CKME 136 - Capstone Project [Farrukh Aziz]

Introduction

This project aims to cluster Yelp Dataset restaurants from 10 metropolitan cities of North America in contiguous groups of geo-spatial locations. Then the insight into the interest of customers that review restaurants of a particular cluster is used to indicate supply and demand proportions of various categories of restaunts.

Summary

This project aims to detect Supply/Demand patterns of restaurants with various categories of food based on customer reviews of the restaurants. The aim is to predict the category of a restaurant (classification) simply based on the location and interests of the customers. However, it is not a straighforward merging of the customers, reviews and businesses data to try various classifiers and predict the categories. The customer reviews data must be conditioned to respresent their interest in particular categories of food in individual localities of restaurant businesses to acheive classification, which can be further extrapolated to Supply vs. Demand paradigm. Here are the steps that are taken:

  1. Pick top 20 categories of food that restaurants offer as focus for analysis
  2. Use unsupervised machine learning algorithms (DBSCAN) to geospatially cluster restaurants
  3. For each customer, aggregate their reviews for restaurants of each category, treating it as their intrest or demand for that type of restaurant.
  4. For each cluster, group restaurants by category to calculate aggregate supply of restaurants of each category
  5. For each cluster, aggregate demand by summing up restaurant demands calculated in step 3 specific to customers of the that cluster of restaurants only
  6. Merge aggregated data for each cluster to individual restaurants
  7. Run Classification algorithms to find accuracy of category prediction for each category
  8. Add new unlabbled restaurants to each cluster by supply/demand ratio, predict their categories and indicate ratio on the map
  9. Validate clusters in a known area to see supply/demand of individual categories
    ### Index Please click links to jump to the specific area:

    Import Libraries
    Attribute Analysis - For Clustering
    Attribute Selection - For Clustering
    Data Clustering
    User Data Preparation
    Mapping Data Preparation
    Classification Model Evaluation
    Data Visualization

Please click links below to access interactive diagrams and maps:
DBSCAN Min. Neighbors & Distance vs Coverage
DBSCAN Min. Neighbors & Distance vs Cluster Count
DBSCAN Min. Neighbors & Distance vs Largest Cluster Size
DBSCAN Label Count Histogram for Min. Neighbors & Distance
DBSCAN Min. Neighbors & Distance vs Coverage

North America Clustered Restaurants by Location (All Categories)
All Clustered Restaurants on Sketch (Toronto)
All Clustered Restaurants on Map (Toronto)
Slider Controlled Categories Displaying Demand (Toronto)



Import Libraries

Import required libraries

For some of the thirs party libraries, you may have to run 'pip install' commands e.g.

  • pip install geopy
  • pip install shaply
  • pip install matplotlib
  • pip install plotly
  • pip install cufflinks
  • pip install swifter
In [903]:
#pk.eyJ1IjoiZjhheml6IiwiYSI6ImNqb3plOWp6MjA0bXIzcnFxczZ1bjdrbmwifQ.5qd5W4B06UUZc20Jax12OA
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time, plotly.plotly as py, plotly.graph_objs as go
from joblib import Parallel, delayed
from ipywidgets import FloatProgress
import matplotlib.pyplot as plt
import multiprocessing
from IPython.core.display import display, HTML
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *
from plotly import tools
import cufflinks as cf
from collections import Counter
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
%matplotlib inline

init_notebook_mode(connected=True)

def __progressbar(ticks):
    __bar = FloatProgress(min=0, max=len(ret_series))
    display(__bar)
    return __bar


Read Data

Read data from the file

The following files from Yelp Dataset will be used:

  1. yelp_academic_dataset_business.json : Business location coordinates, locations and categories etc.
  2. yelp_academic_dataset_review.json : User reviews, related business ids etc.

Rest of the data from the dataset is not useful for the purposes of this project.

In [891]:
business_file = "yelp_academic_dataset_business.json"
review_file = "yelp_academic_dataset_review.json"

start_time = time.time()

df_business_data_full = pd.read_json(business_file, lines=True)
df_review_data_full = pd.read_json(review_file, lines=True)

print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
Time taken: 107.87 seconds


Attribute Analysis

Reveal datatypes and sample data rows from the DataSet

1. Business data datatypes and attributes

In [892]:
df_business_data_full.info()
df_business_data_full.head(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188593 entries, 0 to 188592
Data columns (total 15 columns):
address         188593 non-null object
attributes      162807 non-null object
business_id     188593 non-null object
categories      188052 non-null object
city            188593 non-null object
hours           143791 non-null object
is_open         188593 non-null int64
latitude        188587 non-null float64
longitude       188587 non-null float64
name            188593 non-null object
neighborhood    188593 non-null object
postal_code     188593 non-null object
review_count    188593 non-null int64
stars           188593 non-null float64
state           188593 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 21.6+ MB
Out[892]:
address attributes business_id categories city hours is_open latitude longitude name neighborhood postal_code review_count stars state
0 1314 44 Avenue NE {'BikeParking': 'False', 'BusinessAcceptsCredi... Apn5Q_b6Nz61Tq4XzPdf9A Tours, Breweries, Pizza, Restaurants, Food, Ho... Calgary {'Monday': '8:30-17:0', 'Tuesday': '11:0-21:0'... 1 51.091813 -114.031675 Minhas Micro Brewery T2E 6L6 24 4.0 AB
1 {'Alcohol': 'none', 'BikeParking': 'False', 'B... AjEbIBw6ZFfln7ePHha9PA Chicken Wings, Burgers, Caterers, Street Vendo... Henderson {'Friday': '17:0-23:0', 'Saturday': '17:0-23:0... 0 35.960734 -114.939821 CK'S BBQ & Catering 89002 3 4.5 NV
2 1335 rue Beaubien E {'Alcohol': 'beer_and_wine', 'Ambience': '{'ro... O8S5hYJ1SMc8fA4QBtVujA Breakfast & Brunch, Restaurants, French, Sandw... Montréal {'Monday': '10:0-22:0', 'Tuesday': '10:0-22:0'... 0 45.540503 -73.599300 La Bastringue Rosemont-La Petite-Patrie H2G 1K7 5 4.0 QC

2. Review data types and attributes

In [893]:
df_review_data_full.info()
df_review_data_full.head(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5996996 entries, 0 to 5996995
Data columns (total 9 columns):
business_id    object
cool           int64
date           datetime64[ns]
funny          int64
review_id      object
stars          int64
text           object
useful         int64
user_id        object
dtypes: datetime64[ns](1), int64(4), object(4)
memory usage: 411.8+ MB
Out[893]:
business_id cool date funny review_id stars text useful user_id
0 iCQpiavjjPzJ5_3gPD5Ebg 0 2011-02-25 0 x7mDIiDB3jEiPGPHOmDzyw 2 The pizza was okay. Not the best I've had. I p... 0 msQe1u7Z_XuqjGoqhB0J5g
1 pomGBqfbxcqPv14c3XH-ZQ 0 2012-11-13 0 dDl8zu1vWPdKGihJrwQbpw 5 I love this place! My fiance And I go here atl... 0 msQe1u7Z_XuqjGoqhB0J5g
2 jtQARsP6P-LbkyjbO1qNGg 1 2014-10-23 1 LZp4UX5zK3e-c5ZGSeo3kA 1 Terrible. Dry corn bread. Rib tips were all fa... 3 msQe1u7Z_XuqjGoqhB0J5g


Attribute Selection

Reduce the attribute list to only the useful information for this project.

  • From business data, [ business_id, latitude, longitude, city, state, postal_code and categories ]
  • From review data, [ business_id, review_id, user_id ]
In [904]:
start_time = time.time()

business_cols = ['business_id', 'latitude', 'longitude', 'is_open', 'city', 'neighborhood', \
                 'state', 'postal_code', 'stars', 'categories']
review_cols = ['business_id', 'review_id', 'user_id', 'cool', 'funny', 'stars', 'useful']

df_business_data = df_business_data_full.filter(business_cols , axis=1)
df_review_data = df_review_data_full.filter(review_cols , axis=1)

print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))

df_business_data.to_pickle('df_business_data.pkl')
df_review_data.to_pickle('df_review_data.pkl')

display(df_business_data.head(3))
display(df_review_data.head(3))
Time taken: 2.76 seconds
business_id latitude longitude is_open city neighborhood state postal_code stars categories
0 Apn5Q_b6Nz61Tq4XzPdf9A 51.091813 -114.031675 1 Calgary AB T2E 6L6 4.0 Tours, Breweries, Pizza, Restaurants, Food, Ho...
1 AjEbIBw6ZFfln7ePHha9PA 35.960734 -114.939821 0 Henderson NV 89002 4.5 Chicken Wings, Burgers, Caterers, Street Vendo...
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 0 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 Breakfast & Brunch, Restaurants, French, Sandw...
business_id review_id user_id cool funny stars useful
0 iCQpiavjjPzJ5_3gPD5Ebg x7mDIiDB3jEiPGPHOmDzyw msQe1u7Z_XuqjGoqhB0J5g 0 0 2 0
1 pomGBqfbxcqPv14c3XH-ZQ dDl8zu1vWPdKGihJrwQbpw msQe1u7Z_XuqjGoqhB0J5g 0 0 5 0
2 jtQARsP6P-LbkyjbO1qNGg LZp4UX5zK3e-c5ZGSeo3kA msQe1u7Z_XuqjGoqhB0J5g 1 1 1 3

Prepare Business Data

For the scope of this project, filter down to the businesses located within US/Canada and the ones that have been categorized. This will eliminate noise and small number of restaurants that have not been categorized. Limiting the data to US/Canada will help fit it within North American Map coordinates while retaining majority of the data.

Note: Business categories could be inferred based on user reviews, however, that is outside the scope of this project

In [905]:
start_time = time.time()

print(df_business_data.shape)

north_american_state_provinces = ['AK', 'AL', 'AR', 'AS', 'AZ', 'CA', 'CO', 'CT', 'DC', \
                                  'DE', 'FL', 'GA', 'GU', 'HI', 'IA', 'ID', 'IL', 'IN', \
                                  'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', \
                                  'MP', 'MS', 'MT', 'NA', 'NC', 'ND', 'NE', 'NH', 'NJ', \
                                  'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'RI', \
                                  'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', \
                                  'WI', 'WV', 'WY','AB', 'BC', 'MB', 'NB', 'NL', 'NT', \
                                  'NS', 'NU', 'ON', 'PE', 'QC', 'SK', 'YT']

for idx, row in df_business_data.iterrows():
    if row['state'] not in north_american_state_provinces:
        df_business_data.drop(idx, inplace=True)

df_business_data = df_business_data[df_business_data['categories'].notnull()]


df_business_data.to_pickle('df_business_data.pkl')

print(df_business_data.shape)

print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
df_business_data.head()
(188593, 10)
(187464, 10)
Time taken: 48.75 seconds
Out[905]:
business_id latitude longitude is_open city neighborhood state postal_code stars categories
0 Apn5Q_b6Nz61Tq4XzPdf9A 51.091813 -114.031675 1 Calgary AB T2E 6L6 4.0 Tours, Breweries, Pizza, Restaurants, Food, Ho...
1 AjEbIBw6ZFfln7ePHha9PA 35.960734 -114.939821 0 Henderson NV 89002 4.5 Chicken Wings, Burgers, Caterers, Street Vendo...
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 0 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 Breakfast & Brunch, Restaurants, French, Sandw...
3 bFzdJJ3wp3PZssNEsyU23g 33.449999 -112.076979 1 Phoenix AZ 85003 1.5 Insurance, Financial Services
4 8USyCYqpScwiNEb58Bt6CA 51.035591 -114.027366 1 Calgary AB T2H 0N5 2.0 Home & Garden, Nurseries & Gardening, Shopping...

Filter down to the list of businesses that are categorized as Restaurants

  • Cleanup and build list of all categories
  • Filter down to rows containing Restaurants as category
  • Display number of rows before and after
In [906]:
start_time = time.time()

# create copy so that original business_data is intact
df_business_categorized_data = df_business_data.copy()

df_business_categorized_data['categories'] = df_business_data['categories'] \
.map(lambda x : (list(map(str.strip, x.split(',')))))

print('Total data rows and columns:{}'.format(df_business_categorized_data.shape))

df_restaurants = df_business_categorized_data[df_business_categorized_data['categories'] \
                                              .map(lambda x : 'Restaurants' in x)]

print('Restaurant data rows and columns:{}'.format(df_restaurants.shape))

print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
Total data rows and columns:(187464, 10)
Restaurant data rows and columns:(56839, 10)
Time taken: 1.87 seconds

Since we have dropped all rows that don't have Restaurant as category, dataframe must be re-indexed to fill the gaps.

In [162]:
df_restaurants = df_restaurants.reset_index(drop=True)
df_restaurants.head()
Out[162]:
business_id latitude longitude is_open city neighborhood state postal_code stars categories
0 Apn5Q_b6Nz61Tq4XzPdf9A 51.091813 -114.031675 1 Calgary AB T2E 6L6 4.0 [Tours, Breweries, Pizza, Restaurants, Food, H...
1 AjEbIBw6ZFfln7ePHha9PA 35.960734 -114.939821 0 Henderson NV 89002 4.5 [Chicken Wings, Burgers, Caterers, Street Vend...
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 0 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 [Breakfast & Brunch, Restaurants, French, Sand...
3 6OuOZAok8ikONMS_T3EzXg 43.712946 -79.632763 1 Mississauga Ridgewood ON L4T 1A8 2.0 [Restaurants, Thai]
4 8-NRKkPY1UiFXW20WXKiXg 33.448106 -112.341302 1 Avondale AZ 85323 2.5 [Mexican, Restaurants]

Display list of unique states. We will use one of these states to cluster at state level to make sense of the clustered data

In [163]:
# list of states included in the dataset
df_restaurants.state.unique()
Out[163]:
array(['AB', 'NV', 'QC', 'ON', 'AZ', 'OH', 'IL', 'WI', 'PA', 'NC', 'SC',
       'IN', 'CO', 'VA', 'NY', 'OR', 'CA', 'MO', 'FL', 'BC'], dtype=object)

Top 20 Categories: Combine categories into a list and count top 50. Top 20 categories that represent actual food categories will be used for analysis.

In [164]:
start_time = time.time()

all_categories = df_restaurants['categories'].sum()

ct = Counter(all_categories)

top_50_categories = [x[0] for x in list(ct.most_common(50))]

print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))

print(top_50_categories)
Time taken: 224.37 seconds
['Restaurants', 'Food', 'Nightlife', 'Bars', 'Sandwiches', 'Fast Food', 'American (Traditional)', 'Pizza', 'Burgers', 'Breakfast & Brunch', 'Italian', 'Mexican', 'Chinese', 'American (New)', 'Coffee & Tea', 'Cafes', 'Japanese', 'Chicken Wings', 'Seafood', 'Salad', 'Event Planning & Services', 'Sushi Bars', 'Delis', 'Canadian (New)', 'Asian Fusion', 'Mediterranean', 'Barbeque', 'Sports Bars', 'Specialty Food', 'Caterers', 'Steakhouses', 'Desserts', 'Bakeries', 'Indian', 'Thai', 'Pubs', 'Diners', 'Vietnamese', 'Middle Eastern', 'Vegetarian', 'Greek', 'French', 'Wine Bars', 'Cocktail Bars', 'Korean', 'Ice Cream & Frozen Yogurt', 'Beer', 'Wine & Spirits', 'Buffets', 'Arts & Entertainment']

Since we are calculating demand for specific categories of food. To limit the scope of this project, we choose top 20 specific categories of foods that the businesses belong to:

  1. Sandwiches
  2. American (Traditional)
  3. Pizza
  4. Burgers
  5. Italian
  6. Mexican
  7. Chinese
  8. American (New)
  9. Japanese
  10. Chicken Wings
  11. Seafood
  12. Sushi Bars
  13. Canadian (New)
  14. Asian Fusion
  15. Mediterranean
  16. Steakhouses
  17. Indian
  18. Thai
  19. Vietnamese
  20. Middle Eastern
In [907]:
top_20_specific_categories = ['Sandwiches', 'American (Traditional)', 'Pizza', \
                              'Burgers', 'Italian', 'Mexican', 'Chinese', \
                              'American (New)', 'Japanese', 'Chicken Wings', \
                              'Seafood', 'Sushi Bars', 'Canadian (New)', \
                              'Asian Fusion', 'Mediterranean', 'Steakhouses', \
                              'Indian', 'Thai', 'Vietnamese', 'Middle Eastern']
# save top 20 categories for later sections
pd.DataFrame(top_20_specific_categories, columns=['categories']).to_pickle('df_top_20_specific_categories.pkl')
len(top_20_specific_categories)
Out[907]:
20

Category Reduction: Reduce categories of each business to only include top 20 categories. All categories other than the top 20 selected above are removed for optimization since they are not useful for the purpose of this analysis.

In [908]:
for idx, row in df_restaurants.iterrows():
    categories = row['categories']
    new_categories = list(set(categories) & set(top_20_specific_categories))
    df_restaurants.at[idx, 'categories'] = new_categories

# remove restaurants that don't have one of these categories
df_restaurants = df_restaurants[df_restaurants['categories'].astype(str) != '[]']
print(len(df_restaurants))
df_restaurants.head()
45344
Out[908]:
business_id latitude longitude is_open city neighborhood state postal_code stars categories
0 Apn5Q_b6Nz61Tq4XzPdf9A 51.091813 -114.031675 1 Calgary AB T2E 6L6 4.0 [Pizza]
1 AjEbIBw6ZFfln7ePHha9PA 35.960734 -114.939821 0 Henderson NV 89002 4.5 [Chicken Wings, Burgers]
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 0 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 [Sandwiches]
7 6OuOZAok8ikONMS_T3EzXg 43.712946 -79.632763 1 Mississauga Ridgewood ON L4T 1A8 2.0 [Thai]
8 8-NRKkPY1UiFXW20WXKiXg 33.448106 -112.341302 1 Avondale AZ 85323 2.5 [Mexican]

Getting Dummies Create one column per category within the dataframe with value 1 if that category applies to the business, 0 otherwise. It uses similar approach to Get Dummies which is often used in pandas for optmization.

In [909]:
df_category_flags = pd.DataFrame(0, index=np.arange(len(df_restaurants)), \
                                 columns=top_20_specific_categories)

for index, row in df_restaurants.iterrows():
    for category in row['categories']:
        df_category_flags.at[index, category] = 1

restaurant_category_count = pd.DataFrame(df_category_flags.sum(), columns=['Count'])


display(restaurant_category_count)
Count
Sandwiches 6910.0
American (Traditional) 6649.0
Pizza 6578.0
Burgers 5114.0
Italian 4503.0
Mexican 4412.0
Chinese 4235.0
American (New) 4229.0
Japanese 2565.0
Chicken Wings 2537.0
Seafood 2356.0
Sushi Bars 2153.0
Canadian (New) 1828.0
Asian Fusion 1775.0
Mediterranean 1741.0
Steakhouses 1522.0
Indian 1409.0
Thai 1388.0
Vietnamese 1225.0
Middle Eastern 1182.0

Replace category list for each restaurant with binary flag for each category within restaurants data

In [177]:
df_restaurants_flagged = df_restaurants.join(df_category_flags)
print(len(df_restaurants_flagged))
df_restaurants_flagged.head()
45344
Out[177]:
business_id latitude longitude is_open city neighborhood state postal_code stars categories ... Seafood Sushi Bars Canadian (New) Asian Fusion Mediterranean Steakhouses Indian Thai Vietnamese Middle Eastern
0 Apn5Q_b6Nz61Tq4XzPdf9A 51.091813 -114.031675 1 Calgary AB T2E 6L6 4.0 [Pizza] ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 AjEbIBw6ZFfln7ePHha9PA 35.960734 -114.939821 0 Henderson NV 89002 4.5 [Chicken Wings, Burgers] ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 0 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 [Sandwiches] ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 6OuOZAok8ikONMS_T3EzXg 43.712946 -79.632763 1 Mississauga Ridgewood ON L4T 1A8 2.0 [Thai] ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 8-NRKkPY1UiFXW20WXKiXg 33.448106 -112.341302 1 Avondale AZ 85323 2.5 [Mexican] ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 30 columns

Save flagged restaurants to easily load for analysis later

In [910]:
df_restaurants_flagged.to_pickle('df_restaurants_flagged.pkl')
In [911]:
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
df_supply_indicator_by_category = df_restaurants_flagged.filter(top_20_specific_categories).sum()
df_supply_indicator_by_category.to_pickle('df_supply_indicator_by_category.pkl')
display(df_supply_indicator_by_category.to_frame('Supply (Restaurant Count)'))
Supply (Restaurant Count)
Sandwiches 6910.0
American (Traditional) 6649.0
Pizza 6578.0
Burgers 5114.0
Italian 4503.0
Mexican 4412.0
Chinese 4235.0
American (New) 4229.0
Japanese 2565.0
Chicken Wings 2537.0
Seafood 2356.0
Sushi Bars 2153.0
Canadian (New) 1828.0
Asian Fusion 1775.0
Mediterranean 1741.0
Steakhouses 1522.0
Indian 1409.0
Thai 1388.0
Vietnamese 1225.0
Middle Eastern 1182.0


Data Clustering

We need to set parameters for clustering restaurants using DBSCAN algorithm:

Define parameters for DB SCAN clustering algorithm

  1. epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
  2. min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.

Define parameters for DB SCAN clustering algorithm

  1. epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
  2. min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.
In [912]:
kms_per_radian = 6371.0088
epsilon = 0.5 / kms_per_radian
min_neighbors = 4
In [913]:
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
df_population_size_compare = pd.DataFrame(0, index=range(0,255), \
                                          columns=['Minimum Neighbors','Epsilon(m)','Coverage','Count'])
In [184]:
start_mn = 3
end_mn = 20
start_eps = 50
end_eps = 1500

start_time = time.time()
indx = 0
for mn in range(start_mn,end_mn+1):
    for e in range(start_eps, end_eps+50, 50):
        eps = e/1000/kms_per_radian
        dbscn = DBSCAN(eps=eps, min_samples=mn, algorithm='ball_tree', metric='haversine') \
        .fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))
        cluster_coverage = sum(dbscn.labels_ >= mn)
        cluster_count = sum(np.unique(dbscn.labels_) >= mn)
        df_population_size_compare.at[indx, 'Minimum Neighbors'] = mn
        df_population_size_compare.at[indx, 'Epsilon(m)'] = e
        df_population_size_compare.at[indx, 'Coverage'] = cluster_coverage
        df_population_size_compare.at[indx, 'Count'] = cluster_count
        indx = indx + 1
        print("Completed mn:{} e:{} in {:,.2f} seconds".format(mn, e, time.time() - start_time))

df_population_size_compare.head()
Out[184]:
Minimum Neighbors Epsilon(m) Coverage Count Compression
0 3.0 50.0 20745.0 3810.0 81.634129
1 3.0 100.0 31235.0 4126.0 86.790459
2 3.0 150.0 36244.0 3706.0 89.774859
3 3.0 200.0 38598.0 3291.0 91.473651
4 3.0 250.0 39873.0 3000.0 92.476112

Cluster data using DB SCAN algorithm

In [ ]:
df_population_size_compare.to_pickle('df_population_size_compare.pkl')
In [962]:
df_population_size_compare = pd.read_pickle('df_population_size_compare.pkl')
df_population_size_compare['Compression'] = 100 * (1 - df_population_size_compare['Count']/df_population_size_compare['Coverage'])
df_population_size_compare.head()
Out[962]:
Minimum Neighbors Epsilon(m) Coverage Count Compression
0 3.0 50.0 20745.0 3810.0 81.634129
1 3.0 100.0 31235.0 4126.0 86.790459
2 3.0 150.0 36244.0 3706.0 89.774859
3 3.0 200.0 38598.0 3291.0 91.473651
4 3.0 250.0 39873.0 3000.0 92.476112
In [970]:
df_population_size_compare = pd.read_pickle('df_population_size_compare.pkl')

x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Coverage'

x_start = 3
x_end = 20
max_x = []
max_y = []
max_z = []

max_2x = []
max_2y = []
max_2z = []

for i in range(x_start, x_end+1):
    # figure out the peak values line
    df = df_population_size_compare[df_population_size_compare[x_col] == (i)].reset_index(drop=True)
    max_row = df[df[z_col] == df[z_col].max()]
    max_x.append(max_row[x_col].values[0])
    max_y.append(max_row[y_col].values[0])
    max_z.append(max_row[z_col].values[0])
    
    # find the second peak line
    df = df_population_size_compare[df_population_size_compare[x_col]==i].reset_index(drop=True)
    peak2df = df[df[y_col] <= 300]
    max_2row = peak2df[peak2df[z_col] == peak2df[z_col].max()]
    max_2x.append(max_2row[x_col].values[0])
    max_2y.append(max_2row[y_col].values[0])
    max_2z.append(max_2row[z_col].values[0])
    
x = df_population_size_compare[x_col].values
y = df_population_size_compare[y_col].values
z = df_population_size_compare[z_col].values
    
traces = []
traces.append(go.Scatter3d(
    x=x,
    y=y,
    z=z,
    mode='markers',
    marker=dict(
        size=6,
        color=z,
        colorscale='Jet',   
        opacity=0.8
    ),
    showlegend=True, 
    name='Coverage'
))
# draw max line for z values
traces.append(go.Scatter3d(
    z=max_z,
    y=max_y,
    x=max_x,
    line=dict(
        color='teal',
        width = 4
    ),
    mode='lines',
    name='Max Counts Line'
))
# draw 2nd peak line for z values
traces.append(go.Scatter3d(
    z=max_2z,
    y=max_2y,
    x=max_2x,
    line=dict(
        color='purple',
        width = 4
    ),
    mode='lines',
    name='2nd Max Counts Line'
))


layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=50,
        t=50
    ),
    
    paper_bgcolor='#999999',
    title='Clustered Points Coverage vs. Minimum Neighbors & Distance (meters)',
    scene=dict(
        camera = dict(
            up=dict(x=0, y=0, z=1),
            center=dict(x=0, y=0, z=-.25),
            eye=dict(x=1.25, y=1.25, z=1.25)
        ),
        xaxis=dict( title= x_col),
        yaxis=dict( title= y_col),
        zaxis=dict( title= z_col) 
    ),
    font= dict(color='#ffffff')
)

fig = go.Figure(data=traces, layout=layout)
display(HTML('<a id="mn_e_coverage">DBSCAN Min. Neighbors & Distance vs Coverage</a>'))
iplot(fig, filename='clusters-scatter')

From the 3D Scatter Heat Plot above, we can observe that the cluster Coverage (total number of points clustered) is inversly proportional to Minimum Neighbors count where it is maximized at mn=x=3, whereas, it has 2 peaks on Maximum Distance Epsilon(e) axis. First peak is at 550 meters, second peak is between values of 50 to 350 for values of Minimum Neighbors less than 6.

To further narrow down to ideal parameters, we will look at Ribbon Plot for values of Minimum Neighbors X-axies and Maximum Distance Epsilon Y-axis against number of clusters that resulted from clusterings Count. The aim is to narrow down to a range where cluster count is maximized.

In [971]:
df_population_size_compare = pd.read_pickle('df_population_size_compare.pkl')

x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Count'


x_start = 3
x_end = 20
y_start = 0
y_end = 30
traces = []
max_x = []
max_y = []
max_z = []
for i in range(x_start, x_end+1):
    x = []
    y = []
    z = []
    ci = int(255/18*i) # "color index"
    df = df_population_size_compare[df_population_size_compare[x_col] == (i)].reset_index(drop=True)
    max_row = df[df[z_col] == df[z_col].max()]
    max_x.append(max_row[x_col].values[0])
    max_y.append(max_row[y_col].values[0])
    max_z.append(max_row[z_col].values[0])
    max_x.append(max_row[x_col].values[0] + 0.5)
    max_y.append(max_row[y_col].values[0])
    max_z.append(max_row[z_col].values[0])
    
    for j in range(y_start, y_end):
        x.append([i, i+.5])
        y.append([df.loc[j,y_col], df.loc[j,y_col]])
        z.append([df.loc[j,z_col], df.loc[j,z_col]])
    traces.append(dict(
        z=z,
        x=x,
        y=y,
        colorscale=[ [i, 'rgb(255,%d,%d)'%(ci, ci)] for i in np.arange(0,1.1,0.1) ],
        showscale=False,
        type='surface'
    ))
# draw max line for z values
traces.append(go.Scatter3d(
    z=max_z,
    y=max_y,
    x=max_x,
    line=dict(
        color='green',
        width = 8
    ),
    mode='lines',
    name='Max Counts Line'
))


layout = go.Layout(
    autosize=True,
    height=500,
    margin=go.layout.Margin(
        l=0,
        r=0,
        b=0,
        t=50,
        pad=0
    ),
    paper_bgcolor='#999999',
    title='Clustered Ribbons of Cluster Count vs. on Minimum Neighbors & Distance (meters)',
    scene=dict(
        camera = dict(
            up=dict(x=0, y=0, z=1),
            center=dict(x=0, y=0, z=-.25),
            eye=dict(x=1.5, y=1.5, z=1.5)
        ),
        xaxis=dict( title= x_col),
        yaxis=dict( title= y_col),
        zaxis=dict( title= z_col) 
    ),
    font= dict(color='#ffffff')
)
fig = { 'data':traces, 'layout': layout }
display(HTML('<a id="mn_e_count">DBSCAN Min. Neighbors & Distance vs Cluster Count</a>'))
iplot(fig, filename='ribbon-plot-python')

The Ribbon Chart above shows that number of clusters grows inversly proprotional to number of Minimum Neighbors. It peaks around mn = 3. For Maximum Distance to include locations within a cluster Epsilon, cluster count peaks between values of 50 to 350.

We can observe that there is a convergence from both graphs (Coverage & Count) for ranges:

Minimum Neighbors : 3 - 6
Epsilon(e) : 50 - 350 meters

We will investigate only these ranges from here onwards.

In [120]:
df_population_dist_compare = pd.DataFrame(None, index=range(0,28), \
                                          columns=['Minimum Neighbors','Epsilon(m)','Min','Max', 'Labels'])
In [122]:
start_mn = 3
end_mn = 6
start_eps = 50
end_eps = 350

start_time = time.time()
indx = 0
for mn in range(start_mn,end_mn+1):
    for e in range(start_eps, end_eps+50, 50):
        eps = e/1000/kms_per_radian
        dbscn = DBSCAN(eps=eps, min_samples=mn, algorithm='ball_tree', metric='haversine') \
        .fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))
        
        df = pd.DataFrame(dbscn.labels_, columns=['label'])
        
        df_counts = df.groupby(['label']).size().reset_index(name='count')
        df_counts = df_counts[(df_counts['label'] > -1) & (df_counts['count'] >= mn)]
        
        labels = [x for x in dbscn.labels_ if x != -1] # all labels except -1
        
        df_population_dist_compare.at[indx, 'Minimum Neighbors'] = mn
        df_population_dist_compare.at[indx, 'Epsilon(m)'] = e
        df_population_dist_compare.at[indx, 'Min'] = df_counts['count'].min()
        df_population_dist_compare.at[indx, 'Max'] = df_counts['count'].max()
        df_population_dist_compare.at[indx, 'Labels'] = labels
        
        indx = indx + 1
        print("Completed mn:{} e:{} in {:,.2f} seconds".format(mn, e, time.time() - start_time))

df_population_dist_compare.head()
Completed mn:6 e:350 in 46.42 seconds
Out[122]:
Minimum Neighbors Epsilon(m) Min Max Labels
0 3 50 3 109 [0, 1634, 1, 2, 3, 4, 5, 2, 6, 7, 8, 9, 3336, ...
1 3 100 3 1131 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 6, 10, 11, 1...
2 3 150 3 2000 [0, 1, 2, 3, 4, 5, 4, 6, 7, 8, 9, 9, 6, 10, 11...
3 3 200 3 2190 [0, 1, 2, 3, 4, 5, 6, 5, 7, 8, 200, 9, 10, 11,...
4 3 250 3 2539 [0, 1, 2, 3, 4, 5, 6, 7, 6, 8, 9, 10, 11, 12, ...
In [123]:
df_population_dist_compare.to_pickle('df_population_dist_compare.pkl')
In [897]:
df_population_dist_compare = pd.read_pickle('df_population_dist_compare.pkl')

x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Max'
x = df_population_dist_compare[x_col].values
y = df_population_dist_compare[y_col].values
z = df_population_dist_compare[z_col].values
zmin = df_population_dist_compare[z_col].min()
zmax = df_population_dist_compare[z_col].max()
intensity = (df_population_dist_compare[z_col].values - zmin)/(zmax-zmin)

traces = []
traces.append(
    go.Mesh3d(
        x = x,
        y = y,
        z = z,
        intensity = z,
        opacity=0.6,
        colorscale = 'Earth',
        reversescale=True
    )
)

layout = go.Layout(
    title='Largest Cluster vs. Min Neighbors and Epsilon',
    paper_bgcolor='#999999',
    scene = dict(
        camera = dict(
            up=dict(x=0, y=0, z=1),
            center=dict(x=0, y=0, z=-.25),
            eye=dict(x=-2, y=-.8, z=0.3)
        ),
        xaxis=dict( title= x_col),
        yaxis=dict( title= y_col),
        zaxis=dict( title= z_col) 
    ),
    font= dict(color='#ffffff')
)
fig = go.Figure(data=traces, layout=layout)
display(HTML('<a id="mn_e_largest_cluster">DBSCAN Min. Neighbors & Distance vs Largest Cluster Size</a>'))
iplot(fig, filename='max-3d-mesh')
In [972]:
df_population_dist_compare = pd.read_pickle('df_population_dist_compare.pkl')

numCols = 4
fig = tools.make_subplots(rows=7, cols=4)

idx = 0
for index, row in df_population_dist_compare.iterrows():

    trace = go.Histogram(
        x = row['Labels'],
        name = "mn:{}<br>e:{}" \
        .format(row['Minimum Neighbors'], row['Epsilon(m)'])
    )
    i,j = idx // numCols + 1, idx % numCols + 1
    fig.append_trace(trace, i, j)
    idx = idx + 1
    fig['layout']['xaxis' + str(idx)]['tickformat'] = 's'
    fig['layout']['yaxis' + str(idx)]['tickformat'] = 's'
fig['layout']['paper_bgcolor'] = '#999999'
fig['layout']['font']['color'] = '#ffffff'
fig['layout']['font']['size'] = 9
fig['layout']['xaxis']['tickformat'] = 's'
fig['layout']['yaxis' + str(idx)]['tickformat'] = 's'
display(HTML('<a id="mn_e_histograms">DBSCAN Label Count Histogram for Min. Neighbors & Distance</a>'))
iplot(fig, filename='binning function')
This is the format of your plot grid:
[ (1,1) x1,y1 ]    [ (1,2) x2,y2 ]    [ (1,3) x3,y3 ]    [ (1,4) x4,y4 ]  
[ (2,1) x5,y5 ]    [ (2,2) x6,y6 ]    [ (2,3) x7,y7 ]    [ (2,4) x8,y8 ]  
[ (3,1) x9,y9 ]    [ (3,2) x10,y10 ]  [ (3,3) x11,y11 ]  [ (3,4) x12,y12 ]
[ (4,1) x13,y13 ]  [ (4,2) x14,y14 ]  [ (4,3) x15,y15 ]  [ (4,4) x16,y16 ]
[ (5,1) x17,y17 ]  [ (5,2) x18,y18 ]  [ (5,3) x19,y19 ]  [ (5,4) x20,y20 ]
[ (6,1) x21,y21 ]  [ (6,2) x22,y22 ]  [ (6,3) x23,y23 ]  [ (6,4) x24,y24 ]
[ (7,1) x25,y25 ]  [ (7,2) x26,y26 ]  [ (7,3) x27,y27 ]  [ (7,4) x28,y28 ]

Based on the histograms drawn above, Teal hisgtogram with Minimum Neighbor distance of 100 meters and Epsilon(e) value of 4 would be our parameters of choice due to the following reasons:

  1. In Cluster Count Ribbon Graph, it is on the maximum curve. It will provide highest number of clusters for minimum 4 neighbors
  2. In Coverage Scatter Graph, it is well above mn=100 and e=5 and rest of the value, only below outliers (which would potentially include noise)
  3. It is in the lower (earth) range of the surface graph which indicates that maximum count of businesses in a cluster will be minimized.
  4. Its histogram is least skewed for e=4 values, which means that its clusters would be more evenly distributed than higher e value.
  5. We will not be selecting e=3 even though it has most evenly distributed histograms because it will not optimize number of cluster.

Define parameters for DB SCAN clustering algorithm

  1. epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
  2. min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.
In [914]:
epsilon = 0.1 / kms_per_radian
min_neighbors = 4
In [915]:
start_time = time.time()

df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')

dbscn = DBSCAN(eps=epsilon, min_samples=min_neighbors, algorithm='ball_tree', metric='haversine') \
.fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))

cluster_labels = dbscn.labels_

print(dbscn)

num_clusters = len(set(cluster_labels))

message = ' Total points clustered: {:,} \n Number of clusters: {:,} \n Compression ratio: {:.1f}% \n Time taken: {:,.2f} seconds'
print(message.format(len(df_restaurants_flagged), num_clusters, \
                     100*(1 - float(num_clusters) / len(df_restaurants_flagged)), time.time()-start_time))

fd_cluster_labels = pd.DataFrame(cluster_labels, columns=['label'])
print('Number of labels:{}'.format(len(cluster_labels)))
fd_cluster_labels.to_pickle('fd_cluster_labels.pkl')
fd_cluster_labels.head()

# Join cluster labels with the original dataset of the restaurants
df_restaurants_labeled = df_restaurants_flagged.join(pd.DataFrame(fd_cluster_labels))

# Filter out clusters that do not qualify requirements of minimum neighbors
df_rst_lbl_grouped = df_restaurants_labeled.groupby(['label']).size().reset_index(name='count')
df_lbl_counts = df_rst_lbl_grouped[(df_rst_lbl_grouped['label'] > -1) \
                                   & (df_rst_lbl_grouped['count'] >= min_neighbors)].set_index('label')

# Remove all restaurants that were not labeled
df_restaurants_label_filtered = df_restaurants_labeled.join(df_lbl_counts, on='label', how='inner')

df_restaurants_labeled.to_pickle('df_restaurants_labeled.pkl')

print(len(df_restaurants_label_filtered))

df_restaurants_label_filtered.to_pickle('df_restaurants_label_filtered.pkl')
df_restaurants_label_filtered.head()
DBSCAN(algorithm='ball_tree', eps=1.5696101377226163e-05, leaf_size=30,
    metric='haversine', metric_params=None, min_samples=4, n_jobs=1,
    p=None)
 Total points clustered: 45,344 
 Number of clusters: 2,738 
 Compression ratio: 94.0% 
 Time taken: 2.30 seconds
Number of labels:45344
19388
Out[915]:
business_id latitude longitude is_open city neighborhood state postal_code stars categories ... Canadian (New) Asian Fusion Mediterranean Steakhouses Indian Thai Vietnamese Middle Eastern label count
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 0 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 [Sandwiches] ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6
5106 ypIzqbkJli_75hgFz98WiQ 33.478759 -111.925644 1 Scottsdale AZ 85257 3.0 [Sandwiches] ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6
15140 X20bnlwr15SraBzvP7vC4g 36.061235 -115.289685 1 Las Vegas Spring Valley NV 89148 5.0 [American (Traditional), American (New)] ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6
23370 qnZzSC4TKen19Gz9nyKCvw 35.365496 -80.712032 1 Concord NC 28027 3.0 [American (Traditional), Seafood, Steakhouses] ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 6
24981 fId--RMAhHZcJagMSD-aUw 41.509884 -81.603132 0 Cleveland OH 44106 3.5 [Sushi Bars, Sandwiches, American (New)] ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6

5 rows × 32 columns



User Data Preparation

Filter reviews data to only include filtered restaurants reviews

In [917]:
df_reviews_and_restaurants = df_review_data.join(df_restaurants_label_filtered.set_index('business_id'), \
                                                 on='business_id', how='inner', lsuffix='Review ')
print(len(df_reviews_and_restaurants))
1270731
In [918]:
# import data
df_bus_reviews = df_reviews_and_restaurants.set_index('business_id')
df_review_data = pd.read_pickle('df_review_data.pkl')
df_restaurants_label_filtered = pd.read_pickle('df_restaurants_label_filtered.pkl')
top_20_specific_categories = pd.read_pickle('df_top_20_specific_categories.pkl')['categories'].values

Group each user's review for each category restaurants. Higher the count of reviews for a certain category, more the user is likely to visit that category of restaurant.

In [919]:
df_user_categories_only = df_reviews_and_restaurants[np.append(top_20_specific_categories, "user_id")]
df_user_rst_visits = df_user_categories_only.groupby(['user_id']).sum()
df_user_rst_visits.to_pickle('df_user_rst_visits.pkl')
df_user_rst_visits.head()
Out[919]:
Sandwiches American (Traditional) Pizza Burgers Italian Mexican Chinese American (New) Japanese Chicken Wings Seafood Sushi Bars Canadian (New) Asian Fusion Mediterranean Steakhouses Indian Thai Vietnamese Middle Eastern
user_id
---1lKK3aKOuomHnwAkAow 5.0 4.0 5.0 2.0 4.0 2.0 0.0 8.0 1.0 0.0 2.0 1.0 0.0 2.0 0.0 2.0 2.0 1.0 0.0 0.0
---94vtJ_5o_nikEs6hUjg 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
---udAKDsn0yQXmzbWQNSw 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
--0sXNBv6IizZXuV-nl0Aw 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
--1mPJZdSY9KluaBYAGboQ 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Restaurant/Review Count Ratio: The more the users are reviewing a particular category restaurants, the more they are interested in eating that particular kind of food. Thus overall review count of a restaurant category indicates the interest of users in that category of food and restaurants.

Overall in entire population, equilibrium should exist between review count indicating desire (let's call it Demand Indicator) of a particular restaurant's food type and number of restaurants reviewed of that category that cater to that demand (Supply Indicator).

We can calculate the ratio of the number of restaurants to the number of reviews of each category to find out the ratio by which user interest translates into restaurant count of that category in overall population.

In [920]:
df_demand_indicator_by_category = df_user_rst_visits.sum()
df_demand_indicator_by_category.to_frame('Demand (Review Count)')
Out[920]:
Demand (Review Count)
Sandwiches 167175.0
American (Traditional) 239114.0
Pizza 145039.0
Burgers 143517.0
Italian 138680.0
Mexican 142422.0
Chinese 92490.0
American (New) 235117.0
Japanese 107249.0
Chicken Wings 42579.0
Seafood 116319.0
Sushi Bars 93285.0
Canadian (New) 31760.0
Asian Fusion 87189.0
Mediterranean 51720.0
Steakhouses 77547.0
Indian 28051.0
Thai 51205.0
Vietnamese 34256.0
Middle Eastern 23580.0
In [921]:
review_restaurant_ratio = df_supply_indicator_by_category/df_demand_indicator_by_category
df_restaurant_review_ratio = review_restaurant_ratio.to_frame('Supply/Demand (Restaurant/Review) Ratio')
df_restaurant_review_ratio
Out[921]:
Supply/Demand (Restaurant/Review) Ratio
Sandwiches 0.041334
American (Traditional) 0.027807
Pizza 0.045353
Burgers 0.035633
Italian 0.032470
Mexican 0.030978
Chinese 0.045789
American (New) 0.017987
Japanese 0.023916
Chicken Wings 0.059583
Seafood 0.020255
Sushi Bars 0.023080
Canadian (New) 0.057557
Asian Fusion 0.020358
Mediterranean 0.033662
Steakhouses 0.019627
Indian 0.050230
Thai 0.027107
Vietnamese 0.035760
Middle Eastern 0.050127

Save supply/demand ratio indicator for each category of the restaurant

In [683]:
gb = df_restaurants_label_filtered.groupby(['label'])

__bar = __progressbar(len(gb))

df_clust_group_info = pd.DataFrame({'size': gb.size()})
df_bus_reviews = df_reviews_and_restaurants.set_index('business_id')
df_restaurant_review_ratio_tps = df_restaurant_review_ratio.transpose()

start_time = time.time()

def get_group_info(cur_cluster):
    groupSize = len(cur_cluster)
    df_clust_group_info.at[cur_cluster.name, 'size'] = groupSize
    df_clust_group_info.at[cur_cluster.name, 'latitude'] = cur_cluster['latitude'].sum()/groupSize
    df_clust_group_info.at[cur_cluster.name, 'longitude'] = cur_cluster['longitude'].sum()/groupSize
    df_clust_group_info.at[cur_cluster.name, 'city'] = pd.Series(cur_cluster['city'].unique()).str.cat(sep=', ')
    df_clust_group_info.at[cur_cluster.name, 'zip'] = pd.Series(cur_cluster['postal_code'].unique()).str.cat(sep=', ')
    df_clust_group_info.at[cur_cluster.name, 'neighborhood'] = pd.Series(cur_cluster['neighborhood'].unique()).str.cat(sep=', ')
    df_cur_cluster_reviews = cur_cluster[['business_id']].join(df_bus_reviews, on='business_id', how='inner')
    df_cur_cluster_unique_users = df_cur_cluster_reviews[['user_id']].drop_duplicates()
    df_clust_user_rst_visits = df_cur_cluster_unique_users.join(df_user_rst_visits, on='user_id')
    df_clust_group_info.at[cur_cluster.name, 'reviews_count'] = len(df_cur_cluster_reviews)
    df_clust_group_info.at[cur_cluster.name, 'user_count'] = len(df_cur_cluster_unique_users)
    df_clust_group_info.at[cur_cluster.name, 'total_stars'] = cur_cluster['stars'].sum()
    df_clust_group_info.at[cur_cluster.name, 'total_open'] = cur_cluster['is_open'].sum()

    for category in top_20_specific_categories:
        df_clust_group_info.at[cur_cluster.name, category + ' Supply'] =  cur_cluster[category].sum()
        df_clust_group_info.at[cur_cluster.name, category + ' Demand'] =  df_clust_user_rst_visits[category].sum() \
        * df_restaurant_review_ratio_tps.loc['Supply/Demand (Restaurant/Review) Ratio',category]
        df_clust_group_info.at[cur_cluster.name, category + ' Demand'] =  df_clust_user_rst_visits[category].sum() \
        * df_restaurant_review_ratio_tps.loc['Supply/Demand (Restaurant/Review) Ratio',category]

    __bar.value += 1

gb.apply(get_group_info)
df_clust_group_info.head().transpose()
Out[683]:
label 0.0 1.0 2.0 3.0 5.0
size 6 4 84 134 916
latitude 37.8184 38.0476 40.2546 38.9435 40.0187
longitude -90.6483 -96.3244 -91.6113 -93.6417 -92.8236
city Montréal, Scottsdale, Las Vegas, Concord, Clev... Markham, Scottsdale, Phoenix, Valley City Phoenix, Cleveland, Guadalupe, Las Vegas, Toro... Calgary, Toronto, Chagrin Falls, Avondale, Oak... Monticello, Toronto, Phoenix, Fairlawn, Montré...
zip H2G 1K7, 85257, 89148, 28027, 44106, 28173 L3R 1M5, 85257, 85053, 44280 85007, 44135, 85283, 89119, M1T 3K8, 15234, 85... T3C 1S2, M6J 1V9, M5R 3G6, 44023, T2E 4X4, 853... 61856, M9W 0B5, 85001, 44333, 85032, H2Z 1B9, ...
neighborhood Rosemont-La Petite-Patrie, , Spring Valley Unionville, , Riverside, Southeast, Scarborough, Etobicoke... , Trinity Bellwoods, Seaton Village, Ville-Mar... , Etobicoke, Ville-Marie, Scarborough, Mercier...
reviews_count 278 97 5983 8916 60336
user_count 278 97 5896 8594 50711
total_stars 21 11.5 284.5 450.5 3155.5
total_open 4 2 64 98 678
Sandwiches Supply 3 0 10 24 129
Sandwiches Demand 15.0456 6.11742 367.459 543.169 1956.29
American (Traditional) Supply 2 2 9 23 134
American (Traditional) Demand 17.1568 8.62011 328.371 454.502 1859
Pizza Supply 1 0 12 14 131
Pizza Demand 7.34724 6.03199 326.045 471.629 1842.03
Burgers Supply 0 1 11 16 97
Burgers Demand 6.77035 6.77035 301.815 343.791 1384.64
Italian Supply 0 0 15 12 100
Italian Demand 5.58491 3.53928 239.664 335.907 1260.7
Mexican Supply 0 0 9 20 82
Mexican Demand 4.89458 3.50055 223.509 332.708 1223.61
Chinese Supply 0 0 5 11 80
Chinese Demand 4.12099 6.41042 250.831 341.676 1230.85
American (New) Supply 2 0 4 17 78
American (New) Demand 10.1805 3.2736 209.006 328.511 1157.14
Japanese Supply 0 1 5 3 56
Japanese Demand 2.82212 2.27205 139.217 158.972 783.498
Chicken Wings Supply 0 0 6 6 50
Chicken Wings Demand 4.23042 3.45584 136.386 171.302 691.405
Seafood Supply 1 0 3 7 38
Seafood Demand 6.21818 1.39757 116.626 177.026 659.674
Sushi Bars Supply 1 0 2 0 40
Sushi Bars Demand 4.89292 1.43095 114.43 132.086 596.798
Canadian (New) Supply 0 0 5 1 48
Canadian (New) Demand 0.460453 1.26625 121.214 119.948 645.268
Asian Fusion Supply 0 0 4 6 36
Asian Fusion Demand 2.62619 1.38435 106.269 138.272 535.947
Mediterranean Supply 0 0 5 2 46
Mediterranean Demand 2.28902 1.11085 91.3587 136.129 477.26
Steakhouses Supply 1 0 3 7 28
Steakhouses Demand 5.00484 1.05985 69.7144 89.5375 373.812
Indian Supply 0 0 5 3 31
Indian Demand 1.70782 0.904139 71.6279 86.3453 409.977
Thai Supply 0 0 5 5 16
Thai Demand 1.51798 0.731882 103.683 95.5512 380.85
Vietnamese Supply 0 0 1 4 30
Vietnamese Demand 2.03833 0.393362 71.5918 85.8959 399.191
Middle Eastern Supply 0 0 2 3 33
Middle Eastern Demand 1.75445 1.35344 63.2104 95.9435 354.199
In [684]:
df_clust_group_info.to_pickle('df_clust_group_info.pkl')
In [922]:
df_clust_group_info = pd.read_pickle('df_clust_group_info.pkl')
df_clust_group_info.transpose()
Out[922]:
label 0.0 1.0 2.0 3.0 5.0 7.0 8.0 9.0 10.0 11.0 ... 2710.0 2712.0 2714.0 2715.0 2721.0 2722.0 2723.0 2726.0 2727.0 2733.0
size 6 4 84 134 916 7 25 5 15 22 ... 4 4 4 4 5 4 4 4 4 4
latitude 37.8184 38.0476 40.2546 38.9435 40.0187 36.7466 40.1091 40.5411 39.4455 38.9699 ... 37.5806 37.5807 37.1848 40.9686 39.9462 44.0913 38.9643 41.299 43.5473 40.541
longitude -90.6483 -96.3244 -91.6113 -93.6417 -92.8236 -103.075 -86.3984 -86.9189 -97.3402 -92.9173 ... -89.1039 -96.7572 -96.8282 -86.6563 -92.1143 -86.97 -105.382 -95.0256 -78.5117 -85.745
city Montréal, Scottsdale, Las Vegas, Concord, Clev... Markham, Scottsdale, Phoenix, Valley City Phoenix, Cleveland, Guadalupe, Las Vegas, Toro... Calgary, Toronto, Chagrin Falls, Avondale, Oak... Monticello, Toronto, Phoenix, Fairlawn, Montré... North York, Toronto, Las Vegas, Chandler, Scot... Charlotte, Las Vegas, Madison, Huntersville, P... Charlotte, Las Vegas, Toronto, Bolton Las Vegas, Phoenix, Scottsdale, Toronto, Glend... Fort Mill, Pointe-Aux-Trembles, Las Vegas, Sto... ... Las Vegas, Thornhill, Charlotte Chandler, Pittsburgh, Mt. Lebanon, Las Vegas Charlotte, Las Vegas, Chandler, Newmarket Toronto, Streetsboro, Verdun, Glendale McKees Rocks, Las Vegas, Newmarket, Montréal, ... Vaughan, Joliette, Calgary, Cornelius Scottsdale, Calgary, Boulder City, Charlotte Mesa, Calgary, Concord, Verdun Montréal, Toronto, Parma Montréal, Outremont, Las Vegas, Monroe
zip H2G 1K7, 85257, 89148, 28027, 44106, 28173 L3R 1M5, 85257, 85053, 44280 85007, 44135, 85283, 89119, M1T 3K8, 15234, 85... T3C 1S2, M6J 1V9, M5R 3G6, 44023, T2E 4X4, 853... 61856, M9W 0B5, 85001, 44333, 85032, H2Z 1B9, ... M3K 1E2, M4S 1Z8, 89146, 85225, 85249, 85266, ... 28208, 89130, 53719, 28078, 85009, 28027, H2X ... 28262, 89107, M6E 1C4, L7E 1C8, M5S 1V8 89149, 85024, 85255, M4K 1P5, 85302, 85254, H2... 29715, H1B 4A4, 89102, 44224, 44113, M5J 1E6, ... ... 89128, L3T 2B2, 28226, 28280 85249, 15237, 15216, 89109 28227, 89117, 85224, L3Y 2R2 M9A 1C2, 44241, H4H 1K9, 85301 15136, 89148, L3Y 0C1, H2C 1S6, 85028 L4H 2P8, J6E 3E2, T2A 0P7, 28031 85260, T3J 3C7, 89005, 28203 85202, T2X 0M5, 28025, H4H 1N4 H2T 1R9, M5C 2C5, M5T 3K5, 44134 H2V 4G9, H2V 1L2, 89103, 28110
neighborhood Rosemont-La Petite-Patrie, , Spring Valley Unionville, , Riverside, Southeast, Scarborough, Etobicoke... , Trinity Bellwoods, Seaton Village, Ville-Mar... , Etobicoke, Ville-Marie, Scarborough, Mercier... Downsview, Mount Pleasant and Davisville, West... , Northwest, Ville-Marie, Milliken, Riverside,... Derita, , Corso Italia, The Annex Centennial, , Greektown, Ville-Marie, Mississa... , Rivière-des-Prairies–Pointe-aux-Trembles, We... ... Summerlin, Langstaff, , Uptown , The Strip Eastland, Westside, Etobicoke, , Verdun , Spring Valley, Ahuntsic-Cartierville , Dilworth , Verdun Plateau-Mont-Royal, Downtown Core, Plateau-Mont-Royal, Outremont,
reviews_count 278 97 5983 8916 60336 204 2102 146 745 2166 ... 109 230 40 605 200 37 393 44 204 55
user_count 278 97 5896 8594 50711 204 2090 146 737 2151 ... 109 230 40 605 200 37 393 44 204 55
total_stars 21 11.5 284.5 450.5 3155.5 23 90.5 17 47.5 83 ... 15 12 12.5 16 17.5 14 11.5 15 14 13.5
total_open 4 2 64 98 678 4 20 5 10 16 ... 3 4 2 4 3 3 3 3 3 3
Sandwiches Supply 3 0 10 24 129 2 5 1 0 0 ... 1 2 1 0 1 1 1 1 0 1
Sandwiches Demand 15.0456 6.11742 367.459 543.169 1956.29 21.535 165.708 9.09346 50.7581 129.251 ... 10.0441 16.0789 5.58008 54.8501 19.5923 1.77736 24.139 3.59605 13.6402 3.01738
American (Traditional) Supply 2 2 9 23 134 0 3 0 1 2 ... 0 0 0 0 0 0 2 0 0 1
American (Traditional) Demand 17.1568 8.62011 328.371 454.502 1859 14.7376 114.953 4.58813 36.1767 118.123 ... 8.31424 15.9333 3.69831 43.2952 13.8756 1.47376 25.9994 1.89086 8.70353 1.86306
Pizza Supply 1 0 12 14 131 1 6 1 3 3 ... 0 0 0 1 1 0 0 2 0 0
Pizza Demand 7.34724 6.03199 326.045 471.629 1842.03 16.0097 141.774 10.7941 56.737 117.647 ... 7.80077 16.3725 4.71675 71.3861 15.6015 1.63272 20.3183 4.53533 10.9301 1.67807
Burgers Supply 0 1 11 16 97 0 1 0 1 2 ... 0 0 1 0 1 1 1 0 1 1
Burgers Demand 6.77035 6.77035 301.815 343.791 1384.64 13.22 93.0745 5.55881 35.4909 85.2707 ... 5.87951 11.082 4.38291 33.4598 16.8546 1.46097 19.0639 2.77941 9.83482 2.06674
Italian Supply 0 0 15 12 100 1 1 3 3 5 ... 0 0 1 1 0 0 0 0 0 0
Italian Demand 5.58491 3.53928 239.664 335.907 1260.7 13.5077 90.6575 10.423 46.4652 114.101 ... 6.00703 9.15666 2.98728 46.4652 12.0465 1.39623 12.8908 1.55858 10.488 1.39623
Mexican Supply 0 0 9 20 82 1 2 1 2 1 ... 0 1 1 0 0 1 0 0 0 0
Mexican Demand 4.89458 3.50055 223.509 332.708 1223.61 13.3517 84.6329 5.731 22.2734 77.1361 ... 5.08045 11.3071 4.49186 36.5545 10.2538 1.17718 13.2897 0.898372 5.35926 0.836416
Chinese Supply 0 0 5 11 80 0 2 0 2 3 ... 1 0 0 0 0 0 0 0 0 0
Chinese Demand 4.12099 6.41042 250.831 341.676 1230.85 22.2075 105.864 6.04411 54.9007 79.9013 ... 8.74565 8.60828 2.56417 24.6801 14.6066 1.46524 8.88301 1.41945 26.9696 0.686831
American (New) Supply 2 0 4 17 78 1 1 0 1 5 ... 0 1 0 0 0 0 1 0 0 1
American (New) Demand 10.1805 3.2736 209.006 328.511 1157.14 8.93943 67.5404 2.80594 18.5264 95.7617 ... 5.28812 10.3064 1.97855 25.829 6.15148 0.485643 18.6343 0.467657 4.65858 0.737458
Japanese Supply 0 1 5 3 56 0 0 0 2 1 ... 0 0 0 1 0 0 0 0 0 1
Japanese Demand 2.82212 2.27205 139.217 158.972 783.498 10.0927 50.8222 3.56353 25.2795 48.2392 ... 3.5157 2.51121 1.10015 9.71002 8.68162 0.64574 4.42452 0.837071 13.2975 1.1719
Chicken Wings Supply 0 0 6 6 50 0 0 0 0 1 ... 0 0 0 1 1 0 1 0 0 0
Chicken Wings Demand 4.23042 3.45584 136.386 171.302 691.405 6.91167 46.5942 3.03875 15.4321 42.3638 ... 5.06459 5.24334 2.20458 18.0538 8.46084 0.715 6.49459 0.89375 4.17084 0.417084
Seafood Supply 1 0 3 7 38 1 0 0 0 3 ... 2 0 0 0 1 0 1 0 1 0
Seafood Demand 6.21818 1.39757 116.626 177.026 659.674 5.8941 43.3449 2.47107 15.4948 50.4138 ... 4.51679 4.1522 1.23553 13.5098 6.7448 0.445602 11.1198 0.789931 7.83855 0.648149
Sushi Bars Supply 1 0 2 0 40 0 2 0 1 3 ... 0 0 0 1 0 0 0 0 0 1
Sushi Bars Demand 4.89292 1.43095 114.43 132.086 596.798 6.9701 37.02 2.60802 16.779 46.1596 ... 3.39273 3.55429 1.17707 9.1396 5.60839 0.877033 8.10101 0.715474 6.04691 0.830873
Canadian (New) Supply 0 0 5 1 48 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Canadian (New) Demand 0.460453 1.26625 121.214 119.948 645.268 11.684 75.687 8.34572 58.938 9.26662 ... 2.82028 0 0.17267 1.43892 11.5689 2.35982 0.17267 3.97141 19.2239 3.56851
Asian Fusion Supply 0 0 4 6 36 1 1 0 0 1 ... 0 0 0 0 0 0 1 0 0 0
Asian Fusion Demand 2.62619 1.38435 106.269 138.272 535.947 7.00318 39.3114 2.19867 11.8484 37.6624 ... 2.82977 2.03581 0.834681 8.32645 5.9242 0.366445 10.8101 0.325729 5.65954 0.488594
Mediterranean Supply 0 0 5 2 46 0 1 0 1 0 ... 0 0 0 1 1 0 0 0 0 1
Mediterranean Demand 2.28902 1.11085 91.3587 136.129 477.26 5.89085 52.7484 2.6593 18.6488 28.5791 ... 1.48113 3.93846 0.976199 10.5026 7.60762 0.134648 4.67902 0.538592 4.17409 1.34648
Steakhouses Supply 1 0 3 7 28 0 1 0 1 0 ... 1 0 0 1 0 0 1 0 0 1
Steakhouses Demand 5.00484 1.05985 69.7144 89.5375 373.812 2.62999 27.6934 1.23649 8.47878 26.771 ... 2.15895 2.06081 0.412163 10.9125 2.84589 0.235522 6.94789 0.667311 1.72716 0.529924
Indian Supply 0 0 5 3 31 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Indian Demand 1.70782 0.904139 71.6279 86.3453 409.977 4.67138 34.7089 2.81288 18.4344 17.8316 ... 1.85851 1.20552 0.35161 5.57552 3.86771 0.20092 2.41104 0.552529 5.02299 0.20092
Thai Supply 0 0 5 5 16 1 2 0 0 1 ... 0 0 0 0 0 0 0 1 1 0
Thai Demand 1.51798 0.731882 103.683 95.5512 380.85 5.90927 30.4951 2.19564 12.4691 23.9352 ... 2.11432 2.54803 0.487921 5.74663 3.6323 0.487921 4.22865 0.487921 4.4455 0.271067
Vietnamese Supply 0 0 1 4 30 1 0 0 0 2 ... 0 0 0 0 0 1 0 0 0 0
Vietnamese Demand 2.03833 0.393362 71.5918 85.8959 399.191 6.65139 29.5021 1.50193 14.1968 28.1075 ... 1.46617 1.43041 0.572163 5.90043 5.18522 1.0728 2.21713 0.607923 6.72291 0.250321
Middle Eastern Supply 0 0 2 3 33 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 0
Middle Eastern Demand 1.75445 1.35344 63.2104 95.9435 354.199 3.96005 30.0763 3.3084 16.5921 16.4919 ... 1.05267 2.90738 0.451145 4.66183 7.51908 0.150382 1.25318 0.651654 7.16819 0.501272

50 rows × 2086 columns

For denser cluster population, interval with 20 limits is used to indicate size of each cluster and same number of colors to easily recognize each cluster on the map.



Classification Model Evaluation

We will try the following classification models to evaluate and find the best algorithm for our classification models = ['LR','LDA','KNN','CART','GNB','MNB','BNB','LSVM','SVM','RF','BAG'] from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.naive_bayes import MultinomialNB from sklearn.naive_bayes import BernoulliNB from sklearn.svm import LinearSVC from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import BaggingClassifier

1. Linear Regression - LR
2. Linear Discriminant Analysis - LDA
3. K Nearest Neighbor - KNN
4. Decision Tree Classifier - CART
5. Gaussian Naiive Bayes - GNB
6. Multinomial Naiive Bayes - MNB
7. Bernoulli Naiive Bayes - BNB
8. Linear Support Vector Machine - LSVM
9. Multilinear Support Vector Machine - SVM
10. Random Forest - RF
11. Bagging (Boosting Aggregations) - BAG

In [923]:
top_20_specific_categories = pd.read_pickle('df_top_20_specific_categories.pkl')['categories'].values
df_clust_group_info = pd.read_pickle('df_clust_group_info.pkl')
df_restaurants_label_filtered = pd.read_pickle('df_restaurants_label_filtered.pkl').reset_index()

# Adding additonal columns to the data from groups, clusters ratio and 
# expected demand/supply will be based on this ratio
demand_cats = [x + ' Demand' for x in top_20_specific_categories]
supply_cats = [x + ' Supply' for x in top_20_specific_categories]
local_demand_cats = [x + ' Local Demand' for x in top_20_specific_categories]
display_cats = [x + ' Display' for x in top_20_specific_categories]


# Add new Local Demand columns for each category
df_clust_group_info[local_demand_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])
df_clust_group_info[display_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])

scaler = MinMaxScaler(feature_range=(0, 1))

t = None
for index,row in df_clust_group_info.iterrows():
    cluster_supply = row[supply_cats].transpose().sum()
    cluster_demand = row[demand_cats].transpose().sum()
    cluster_adjustment_ratio = (cluster_supply / cluster_demand) if cluster_demand > 0 else 0
    for x in top_20_specific_categories:
        localDemand = round(row[x + ' Demand'] * cluster_adjustment_ratio)
        df_clust_group_info.at[index, x + ' Local Demand'] = localDemand
    # apply (n - min)/(max - min) formula to the difference of Local Demand and Supply to normalize display
    diff = row[supply_cats].values - df_clust_group_info.loc[index, local_demand_cats].values
    scaled = scaler.fit_transform(diff.astype('float64').reshape(-1,1))
    for i in range(len(diff)):
        df_clust_group_info.at[index, top_20_specific_categories[i] + ' Display'] = scaled[i]
            

df_clust_group_info[display_cats].head() 
        
Out[923]:
Sandwiches Display American (Traditional) Display Pizza Display Burgers Display Italian Display Mexican Display Chinese Display American (New) Display Japanese Display Chicken Wings Display Seafood Display Sushi Bars Display Canadian (New) Display Asian Fusion Display Mediterranean Display Steakhouses Display Indian Display Thai Display Vietnamese Display Middle Eastern Display
label
0.0 1.000000 0.500000 0.500000 0.000000 0.000000 0.000000 0.500000 1.000000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000
1.0 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2.0 0.090909 0.090909 0.454545 0.363636 1.000000 0.454545 0.000000 0.090909 0.363636 0.454545 0.272727 0.181818 0.454545 0.363636 0.545455 0.454545 0.545455 0.454545 0.181818 0.363636
3.0 0.583333 0.833333 0.000000 0.583333 0.333333 1.000000 0.166667 0.750000 0.166667 0.333333 0.416667 0.000000 0.083333 0.500000 0.166667 0.666667 0.416667 0.500000 0.500000 0.333333
5.0 0.086957 0.608696 0.521739 0.478261 0.956522 0.304348 0.173913 0.347826 0.521739 0.521739 0.130435 0.391304 0.608696 0.391304 1.000000 0.565217 0.565217 0.000000 0.565217 0.826087

1. Model Evaluation - All Categories in Individual Binary Columns

The data below is the merged set of clustered information with individual restaurants informaiton. It contains categories of restaurants as binary columns (20 columns, 1 column per category) for classification. The classification of each category from this data seaprately will be compared with all category as one column from step 2 below.

In [820]:
df_clust_group_info_prefixed = df_clust_group_info.add_prefix('Group ')

merged_restaurant_and_groups = pd.merge(df_clust_group_info_prefixed,df_restaurants_label_filtered, \
                                         left_on = 'label', right_on='label')


display(merged_restaurant_and_groups.head().transpose())
0 1 2 3 4
label 0 0 0 0 0
Group size 6 6 6 6 6
Group latitude 37.8184 37.8184 37.8184 37.8184 37.8184
Group longitude -90.6483 -90.6483 -90.6483 -90.6483 -90.6483
Group city Montréal, Scottsdale, Las Vegas, Concord, Clev... Montréal, Scottsdale, Las Vegas, Concord, Clev... Montréal, Scottsdale, Las Vegas, Concord, Clev... Montréal, Scottsdale, Las Vegas, Concord, Clev... Montréal, Scottsdale, Las Vegas, Concord, Clev...
Group zip H2G 1K7, 85257, 89148, 28027, 44106, 28173 H2G 1K7, 85257, 89148, 28027, 44106, 28173 H2G 1K7, 85257, 89148, 28027, 44106, 28173 H2G 1K7, 85257, 89148, 28027, 44106, 28173 H2G 1K7, 85257, 89148, 28027, 44106, 28173
Group neighborhood Rosemont-La Petite-Patrie, , Spring Valley Rosemont-La Petite-Patrie, , Spring Valley Rosemont-La Petite-Patrie, , Spring Valley Rosemont-La Petite-Patrie, , Spring Valley Rosemont-La Petite-Patrie, , Spring Valley
Group reviews_count 278 278 278 278 278
Group user_count 278 278 278 278 278
Group total_stars 21 21 21 21 21
Group total_open 4 4 4 4 4
Group Sandwiches Supply 3 3 3 3 3
Group Sandwiches Demand 15.0456 15.0456 15.0456 15.0456 15.0456
Group American (Traditional) Supply 2 2 2 2 2
Group American (Traditional) Demand 17.1568 17.1568 17.1568 17.1568 17.1568
Group Pizza Supply 1 1 1 1 1
Group Pizza Demand 7.34724 7.34724 7.34724 7.34724 7.34724
Group Burgers Supply 0 0 0 0 0
Group Burgers Demand 6.77035 6.77035 6.77035 6.77035 6.77035
Group Italian Supply 0 0 0 0 0
Group Italian Demand 5.58491 5.58491 5.58491 5.58491 5.58491
Group Mexican Supply 0 0 0 0 0
Group Mexican Demand 4.89458 4.89458 4.89458 4.89458 4.89458
Group Chinese Supply 0 0 0 0 0
Group Chinese Demand 4.12099 4.12099 4.12099 4.12099 4.12099
Group American (New) Supply 2 2 2 2 2
Group American (New) Demand 10.1805 10.1805 10.1805 10.1805 10.1805
Group Japanese Supply 0 0 0 0 0
Group Japanese Demand 2.82212 2.82212 2.82212 2.82212 2.82212
Group Chicken Wings Supply 0 0 0 0 0
... ... ... ... ... ...
latitude 45.5405 33.4788 36.0612 35.3655 41.5099
longitude -73.5993 -111.926 -115.29 -80.712 -81.6031
is_open 0 1 1 1 0
city Montréal Scottsdale Las Vegas Concord Cleveland
neighborhood Rosemont-La Petite-Patrie Spring Valley
state QC AZ NV NC OH
postal_code H2G 1K7 85257 89148 28027 44106
stars 4 3 5 3 3.5
categories [Sandwiches] [Sandwiches] [American (Traditional), American (New)] [American (Traditional), Seafood, Steakhouses] [Sushi Bars, Sandwiches, American (New)]
Sandwiches 1 1 0 0 1
American (Traditional) 0 0 1 1 0
Pizza 0 0 0 0 0
Burgers 0 0 0 0 0
Italian 0 0 0 0 0
Mexican 0 0 0 0 0
Chinese 0 0 0 0 0
American (New) 0 0 1 0 1
Japanese 0 0 0 0 0
Chicken Wings 0 0 0 0 0
Seafood 0 0 0 1 0
Sushi Bars 0 0 0 0 1
Canadian (New) 0 0 0 0 0
Asian Fusion 0 0 0 0 0
Mediterranean 0 0 0 0 0
Steakhouses 0 0 0 1 0
Indian 0 0 0 0 0
Thai 0 0 0 0 0
Vietnamese 0 0 0 0 0
Middle Eastern 0 0 0 0 0
count 6 6 6 6 6

83 rows × 5 columns

In [821]:
merged_restaurant_and_groups.to_pickle('merged_restaurant_and_groups.pkl')
In [822]:
# Delete columns that cannot added to new restaurant businesses or that will not be used full in a group

merged_restaurant_and_groups = pd.read_pickle('merged_restaurant_and_groups.pkl')
df_data_clean = merged_restaurant_and_groups.drop(['business_id','index', 'categories', \
                                                                  'latitude', 'longitude', 'Group city','Group zip', 'Group neighborhood'], axis=1)

df_data_clean.to_pickle('df_data_clean.pkl')

print(df_data_clean.shape)
df_data_clean.head().transpose()
(19388, 75)
Out[822]:
0 1 2 3 4
label 0 0 0 0 0
Group size 6 6 6 6 6
Group latitude 37.8184 37.8184 37.8184 37.8184 37.8184
Group longitude -90.6483 -90.6483 -90.6483 -90.6483 -90.6483
Group reviews_count 278 278 278 278 278
Group user_count 278 278 278 278 278
Group total_stars 21 21 21 21 21
Group total_open 4 4 4 4 4
Group Sandwiches Supply 3 3 3 3 3
Group Sandwiches Demand 15.0456 15.0456 15.0456 15.0456 15.0456
Group American (Traditional) Supply 2 2 2 2 2
Group American (Traditional) Demand 17.1568 17.1568 17.1568 17.1568 17.1568
Group Pizza Supply 1 1 1 1 1
Group Pizza Demand 7.34724 7.34724 7.34724 7.34724 7.34724
Group Burgers Supply 0 0 0 0 0
Group Burgers Demand 6.77035 6.77035 6.77035 6.77035 6.77035
Group Italian Supply 0 0 0 0 0
Group Italian Demand 5.58491 5.58491 5.58491 5.58491 5.58491
Group Mexican Supply 0 0 0 0 0
Group Mexican Demand 4.89458 4.89458 4.89458 4.89458 4.89458
Group Chinese Supply 0 0 0 0 0
Group Chinese Demand 4.12099 4.12099 4.12099 4.12099 4.12099
Group American (New) Supply 2 2 2 2 2
Group American (New) Demand 10.1805 10.1805 10.1805 10.1805 10.1805
Group Japanese Supply 0 0 0 0 0
Group Japanese Demand 2.82212 2.82212 2.82212 2.82212 2.82212
Group Chicken Wings Supply 0 0 0 0 0
Group Chicken Wings Demand 4.23042 4.23042 4.23042 4.23042 4.23042
Group Seafood Supply 1 1 1 1 1
Group Seafood Demand 6.21818 6.21818 6.21818 6.21818 6.21818
... ... ... ... ... ...
Group Vietnamese Demand 2.03833 2.03833 2.03833 2.03833 2.03833
Group Middle Eastern Supply 0 0 0 0 0
Group Middle Eastern Demand 1.75445 1.75445 1.75445 1.75445 1.75445
is_open 0 1 1 1 0
city Montréal Scottsdale Las Vegas Concord Cleveland
neighborhood Rosemont-La Petite-Patrie Spring Valley
state QC AZ NV NC OH
postal_code H2G 1K7 85257 89148 28027 44106
stars 4 3 5 3 3.5
Sandwiches 1 1 0 0 1
American (Traditional) 0 0 1 1 0
Pizza 0 0 0 0 0
Burgers 0 0 0 0 0
Italian 0 0 0 0 0
Mexican 0 0 0 0 0
Chinese 0 0 0 0 0
American (New) 0 0 1 0 1
Japanese 0 0 0 0 0
Chicken Wings 0 0 0 0 0
Seafood 0 0 0 1 0
Sushi Bars 0 0 0 0 1
Canadian (New) 0 0 0 0 0
Asian Fusion 0 0 0 0 0
Mediterranean 0 0 0 0 0
Steakhouses 0 0 0 1 0
Indian 0 0 0 0 0
Thai 0 0 0 0 0
Vietnamese 0 0 0 0 0
Middle Eastern 0 0 0 0 0
count 6 6 6 6 6

75 rows × 5 columns

In [629]:
import warnings; warnings.simplefilter('ignore')

df_data_clean = pd.read_pickle('df_data_clean.pkl')

# Eliminate categories indicators from the dataset since that's what we are trying to predict
X = df_data_clean[df_data_clean.columns.difference(top_20_specific_categories)]


# Transform strings into equivalent label values
for column in X.columns:
    if X[column].dtype == type(object):
        le = LabelEncoder()
        X[column] = le.fit_transform(X[column])
        
df_cross_val_results = pd.DataFrame(None, columns=['category', 'score'])   


__bar14 = __progressbar(20 * 11)
for cat in top_20_specific_categories:
        
    # Iterate through each category to predict it
    y = df_data_clean[cat]

    # Use a MinMaxScaler to scale values between 0 and 1
    # It is needed by some algorithms such as MultinomialNB

    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)
    print('--{}--'.format(cat))
    result = run_classifiers(X = X_scaled, y = y, num_splits = 10, rnd_state = 1, __bar = __bar14)
    __bar14.value += 11
    df_cross_val_results = df_cross_val_results.append({'category': cat, 'score': result}, ignore_index=True)
    

    
0 1 2 3 4
Group latitude 37.8184 37.8184 37.8184 37.8184 37.8184
Group longitude -90.6483 -90.6483 -90.6483 -90.6483 -90.6483
Group reviews_count 278 278 278 278 278
Group size 6 6 6 6 6
Group total_open 4 4 4 4 4
Group total_stars 21 21 21 21 21
Group user_count 278 278 278 278 278
city Montréal Scottsdale Las Vegas Concord Cleveland
count 6 6 6 6 6
is_open 0 1 1 1 0
label 0 0 0 0 0
neighborhood Rosemont-La Petite-Patrie Spring Valley
postal_code H2G 1K7 85257 89148 28027 44106
stars 4 3 5 3 3.5
state QC AZ NV NC OH
--Sandwiches--
LR: 0.847895 (0.007896)
LDA: 0.813960 (0.098990)
KNN: 0.827262 (0.013547)
CART: 0.719514 (0.031433)
GNB: 0.801582 (0.134187)
MNB: 0.847895 (0.007896)
BNB: 0.847379 (0.008042)
LSVM: 0.847895 (0.007896)
SVM: 0.847895 (0.007896)
RF: 0.838301 (0.009737)
BAG: 0.827933 (0.015352)
--American (Traditional)--
LR: 0.853413 (0.009495)
LDA: 0.853413 (0.009495)
KNN: 0.832577 (0.008846)
CART: 0.738960 (0.026213)
GNB: 0.852072 (0.010560)
MNB: 0.853413 (0.009495)
BNB: 0.853104 (0.009795)
LSVM: 0.853413 (0.009495)
SVM: 0.853413 (0.009495)
RF: 0.843458 (0.011992)
BAG: 0.837166 (0.013054)
--Pizza--
LR: 0.856716 (0.006166)
LDA: 0.855375 (0.007655)
KNN: 0.838199 (0.005788)
CART: 0.726892 (0.028601)
GNB: 0.809681 (0.141103)
MNB: 0.856716 (0.006166)
BNB: 0.856509 (0.006256)
LSVM: 0.856716 (0.006166)
SVM: 0.856716 (0.006166)
RF: 0.848876 (0.007648)
BAG: 0.843047 (0.010236)
--Burgers--
LR: 0.883896 (0.006393)
LDA: 0.883948 (0.006457)
KNN: 0.874612 (0.008482)
CART: 0.777077 (0.042547)
GNB: 0.841812 (0.108876)
MNB: 0.883896 (0.006393)
BNB: 0.883690 (0.006362)
LSVM: 0.883896 (0.006393)
SVM: 0.883896 (0.006393)
RF: 0.879409 (0.007172)
BAG: 0.869298 (0.015425)
--Italian--
LR: 0.900661 (0.006211)
LDA: 0.863735 (0.112721)
KNN: 0.892718 (0.006428)
CART: 0.799001 (0.028914)
GNB: 0.841197 (0.180274)
MNB: 0.900661 (0.006211)
BNB: 0.899887 (0.005522)
LSVM: 0.900661 (0.006211)
SVM: 0.900661 (0.006211)
RF: 0.898392 (0.006433)
BAG: 0.892306 (0.017095)
--Mexican--
LR: 0.902517 (0.007995)
LDA: 0.861981 (0.122040)
KNN: 0.889726 (0.012381)
CART: 0.800183 (0.035007)
GNB: 0.838721 (0.182547)
MNB: 0.902517 (0.007995)
BNB: 0.901795 (0.008374)
LSVM: 0.902517 (0.007995)
SVM: 0.902517 (0.007995)
RF: 0.894522 (0.014411)
BAG: 0.888177 (0.020087)
--Chinese--
LR: 0.908553 (0.007538)
LDA: 0.908553 (0.007538)
KNN: 0.901744 (0.007847)
CART: 0.814318 (0.026507)
GNB: 0.908449 (0.007508)
MNB: 0.908553 (0.007538)
BNB: 0.907985 (0.007452)
LSVM: 0.908553 (0.007538)
SVM: 0.908553 (0.007538)
RF: 0.904220 (0.007172)
BAG: 0.890652 (0.028261)
--American (New)--
LR: 0.909532 (0.006254)
LDA: 0.909532 (0.006254)
KNN: 0.899216 (0.006749)
CART: 0.817722 (0.034989)
GNB: 0.888645 (0.060277)
MNB: 0.909532 (0.006254)
BNB: 0.909067 (0.005936)
LSVM: 0.909532 (0.006254)
SVM: 0.909532 (0.006254)
RF: 0.905250 (0.006682)
BAG: 0.897770 (0.015410)
--Japanese--
LR: 0.944605 (0.003813)
LDA: 0.944605 (0.003813)
KNN: 0.940840 (0.003527)
CART: 0.871774 (0.031869)
GNB: 0.887978 (0.172244)
MNB: 0.944605 (0.003813)
BNB: 0.943831 (0.004171)
LSVM: 0.944605 (0.003813)
SVM: 0.944605 (0.003813)
RF: 0.943470 (0.003636)
BAG: 0.941871 (0.005777)
--Chicken Wings--
LR: 0.944243 (0.004935)
LDA: 0.902160 (0.124946)
KNN: 0.941613 (0.005567)
CART: 0.882296 (0.023363)
GNB: 0.942541 (0.006064)
MNB: 0.944243 (0.004935)
BNB: 0.943676 (0.005436)
LSVM: 0.944243 (0.004935)
SVM: 0.944243 (0.004935)
RF: 0.942851 (0.004795)
BAG: 0.941355 (0.007066)
--Seafood--
LR: 0.949247 (0.004119)
LDA: 0.949247 (0.004119)
KNN: 0.947855 (0.004298)
CART: 0.871251 (0.066652)
GNB: 0.949041 (0.004381)
MNB: 0.949247 (0.004119)
BNB: 0.948473 (0.004765)
LSVM: 0.949247 (0.004119)
SVM: 0.949247 (0.004119)
RF: 0.948164 (0.005171)
BAG: 0.940682 (0.023072)
--Sushi Bars--
LR: 0.953941 (0.004602)
LDA: 0.953941 (0.004602)
KNN: 0.952806 (0.004724)
CART: 0.889310 (0.023994)
GNB: 0.953941 (0.004602)
MNB: 0.953941 (0.004602)
BNB: 0.953631 (0.004536)
LSVM: 0.953941 (0.004602)
SVM: 0.953941 (0.004602)
RF: 0.953528 (0.004724)
BAG: 0.952187 (0.004604)
--Canadian (New)--
LR: 0.959408 (0.004269)
LDA: 0.959408 (0.004269)
KNN: 0.956881 (0.002547)
CART: 0.921756 (0.005679)
GNB: 0.956984 (0.005164)
MNB: 0.959408 (0.004269)
BNB: 0.958892 (0.004332)
LSVM: 0.959408 (0.004269)
SVM: 0.959408 (0.004269)
RF: 0.958635 (0.004192)
BAG: 0.956571 (0.005003)
--Asian Fusion--
LR: 0.958221 (0.005743)
LDA: 0.958221 (0.005743)
KNN: 0.957138 (0.005982)
CART: 0.869854 (0.103497)
GNB: 0.958221 (0.005743)
MNB: 0.958221 (0.005743)
BNB: 0.957654 (0.005955)
LSVM: 0.958221 (0.005743)
SVM: 0.958221 (0.005743)
RF: 0.957189 (0.006373)
BAG: 0.949037 (0.023370)
--Mediterranean--
LR: 0.961729 (0.003518)
LDA: 0.919233 (0.129548)
KNN: 0.961059 (0.003962)
CART: 0.917114 (0.018076)
GNB: 0.898655 (0.191270)
MNB: 0.961729 (0.003518)
BNB: 0.960955 (0.003819)
LSVM: 0.961729 (0.003518)
SVM: 0.961729 (0.003518)
RF: 0.961007 (0.003791)
BAG: 0.960078 (0.005220)
--Steakhouses--
LR: 0.967970 (0.002896)
LDA: 0.967970 (0.002896)
KNN: 0.967299 (0.003071)
CART: 0.926759 (0.012491)
GNB: 0.967763 (0.002573)
MNB: 0.967970 (0.002896)
BNB: 0.967196 (0.003732)
LSVM: 0.967970 (0.002896)
SVM: 0.967970 (0.002896)
RF: 0.967454 (0.002831)
BAG: 0.967196 (0.003029)
--Indian--
LR: 0.969001 (0.003396)
LDA: 0.922637 (0.139801)
KNN: 0.965443 (0.007438)
CART: 0.929905 (0.021262)
GNB: 0.898088 (0.212064)
MNB: 0.969001 (0.003396)
BNB: 0.968485 (0.003580)
LSVM: 0.969001 (0.003396)
SVM: 0.969001 (0.003396)
RF: 0.968228 (0.003643)
BAG: 0.967247 (0.003503)
--Thai--
LR: 0.967505 (0.004921)
LDA: 0.967505 (0.004921)
KNN: 0.966835 (0.004960)
CART: 0.924179 (0.014814)
GNB: 0.961160 (0.009529)
MNB: 0.967505 (0.004921)
BNB: 0.966577 (0.005814)
LSVM: 0.967505 (0.004921)
SVM: 0.967505 (0.004921)
RF: 0.967144 (0.005018)
BAG: 0.966319 (0.005092)
--Vietnamese--
LR: 0.972406 (0.003984)
LDA: 0.972406 (0.003984)
KNN: 0.971580 (0.003370)
CART: 0.934185 (0.021843)
GNB: 0.972044 (0.004327)
MNB: 0.972406 (0.003984)
BNB: 0.971838 (0.004051)
LSVM: 0.972406 (0.003984)
SVM: 0.972406 (0.003984)
RF: 0.971683 (0.003926)
BAG: 0.966162 (0.015685)
--Middle Eastern--
LR: 0.973541 (0.003126)
LDA: 0.973541 (0.003126)
KNN: 0.973025 (0.003066)
CART: 0.937229 (0.009008)
GNB: 0.973541 (0.003126)
MNB: 0.973541 (0.003126)
BNB: 0.972870 (0.002815)
LSVM: 0.973541 (0.003126)
SVM: 0.973541 (0.003126)
RF: 0.973438 (0.003220)
BAG: 0.972561 (0.003849)
In [630]:
df_cross_val_results.to_pickle('df_cross_val_results.pkl')
In [979]:
df_cross_val_results = pd.read_pickle('df_cross_val_results.pkl')

models = ['LR','LDA','KNN','CART','GNB','MNB','BNB','LSVM','SVM','RF','BAG']

df_model_aggr = pd.DataFrame(columns=top_20_specific_categories)

for i, row in df_cross_val_results.iterrows():
    for m in range(0,len(models)):
        df_model_aggr.at[models[m], row['category']] = row['score'][m].mean()

df_model_aggr.head()
Out[979]:
Sandwiches American (Traditional) Pizza Burgers Italian Mexican Chinese American (New) Japanese Chicken Wings Seafood Sushi Bars Canadian (New) Asian Fusion Mediterranean Steakhouses Indian Thai Vietnamese Middle Eastern
LR 0.847895 0.853413 0.856716 0.883896 0.900661 0.902517 0.908553 0.909532 0.944605 0.944243 0.949247 0.953941 0.959408 0.958221 0.961729 0.96797 0.969001 0.967505 0.972406 0.973541
LDA 0.81396 0.853413 0.855375 0.883948 0.863735 0.861981 0.908553 0.909532 0.944605 0.90216 0.949247 0.953941 0.959408 0.958221 0.919233 0.96797 0.922637 0.967505 0.972406 0.973541
KNN 0.827262 0.832577 0.838199 0.874612 0.892718 0.889726 0.901744 0.899216 0.94084 0.941613 0.947855 0.952806 0.956881 0.957138 0.961059 0.967299 0.965443 0.966835 0.97158 0.973025
CART 0.719514 0.73896 0.726892 0.777077 0.799001 0.800183 0.814318 0.817722 0.871774 0.882296 0.871251 0.88931 0.921756 0.869854 0.917114 0.926759 0.929905 0.924179 0.934185 0.937229
GNB 0.801582 0.852072 0.809681 0.841812 0.841197 0.838721 0.908449 0.888645 0.887978 0.942541 0.949041 0.953941 0.956984 0.958221 0.898655 0.967763 0.898088 0.96116 0.972044 0.973541
In [981]:
df_model_aggr.transpose().mean().sort_values(ascending=False)
df_model_aggr.to_pickle('df_model_aggr.pkl')
In [982]:
df_model_aggr = pd.read_pickle('df_model_aggr.pkl')
df_draw = df_model_aggr
iplot([{
    'x': df_draw.index,
    'y': df_draw[col],
    'name': col
}  for col in df_draw.columns], filename='cufflinks/simple-line2')

df_draw = df_model_aggr.transpose()
iplot([{
    'x': df_draw.index,
    'y': df_draw[col],
    'name': col
}  for col in df_draw.columns], filename='cufflinks/simple-line')

Recursive Feature Elimination with Cross Validation: We pick RandomForest Classifier for recursive feature elimination, however, it is clear that over 85% of features somewhat contribute to the results, that's why, we will leave the features alone for individual category classification as a binary.

In [706]:
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
class RandomForestClassifierWithCoef(RandomForestClassifier):
    def fit(self, *args, **kwargs):
        super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
        self.coef_ = self.feature_importances_

nb=RandomForestClassifierWithCoef()

rfecv = RFECV(estimator=nb, step=1, cv=StratifiedKFold(10),
              scoring='accuracy')
rfecv.fit(X, y)
print(type(rfecv.grid_scores_))
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
<class 'numpy.ndarray'>

The above images shows that Support Vector Machine, Linear Support Vector Machine, Multinomial Naiive Bayes and Logistic Regression have performed the best. However, the confusion matrices for top 3 models below shows that this performance of the algorithms for binary category prediction (one category column at a time) is misleading because even if both True and False values are marked as True, cross validation accuracy score of 80-98% is acheived, however, it doesn't help us categorize restaurants. Therefore, we will have to rely on merged category classification (all categories in a single column)

In [826]:
import warnings; warnings.simplefilter('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

from sklearn.metrics import classification_report,confusion_matrix

# Eliminate all category information and binary category columns to prev X data
X = df_data_clean[df_data_clean.columns.difference(np.append(top_20_specific_categories,'category'))]

# Transform strings into equivalent label values
for column in X.columns:
    if X[column].dtype == type(object):
        le = LabelEncoder()
        X[column] = le.fit_transform(X[column])


models = []
models.append(('LR', LogisticRegression()))
models.append(('MNB', MultinomialNB()))
models.append(('BNB', BernoulliNB()))


# we will only show confusion matrix for 2 categories because it is for demo only as it did not work anyway
for name, model in models:
    for cat in np.take(top_20_specific_categories,[1,6]):

        y = df_data_clean[cat]


        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=120, test_size = 0.3)

        start_time = time.time()

        scaler = MinMaxScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        model.fit(X_train_scaled, y_train)

        print('Accuracy of MNB classifier on training set: {:.2f}'.format(model.score(X_train_scaled, y_train)))
        print('Accuracy of MNB classifier on test set: {:.2f}'.format(model.score(X_test_scaled, y_test)))


        y_pred = model.predict(X_test_scaled)

        print(classification_report(y_test,y_pred))
        display(confusion_matrix(y_test, y_pred))
        
        draw_confusion_matrix(cat, y_test, y_pred)
    print('{} - {}'.format(name, total_y_pred))
Accuracy of MNB classifier on training set: 0.85
Accuracy of MNB classifier on test set: 0.85
             precision    recall  f1-score   support

        0.0       0.85      1.00      0.92      4956
        1.0       0.00      0.00      0.00       861

avg / total       0.73      0.85      0.78      5817

array([[4956,    0],
       [ 861,    0]])
Accuracy of MNB classifier on training set: 0.91
Accuracy of MNB classifier on test set: 0.92
             precision    recall  f1-score   support

        0.0       0.92      1.00      0.96      5329
        1.0       0.00      0.00      0.00       488

avg / total       0.84      0.92      0.88      5817

array([[5329,    0],
       [ 488,    0]])
LR - 0
Accuracy of MNB classifier on training set: 0.85
Accuracy of MNB classifier on test set: 0.85
             precision    recall  f1-score   support

        0.0       0.85      1.00      0.92      4956
        1.0       0.00      0.00      0.00       861

avg / total       0.73      0.85      0.78      5817

array([[4956,    0],
       [ 861,    0]])
Accuracy of MNB classifier on training set: 0.91
Accuracy of MNB classifier on test set: 0.92
             precision    recall  f1-score   support

        0.0       0.92      1.00      0.96      5329
        1.0       0.00      0.00      0.00       488

avg / total       0.84      0.92      0.88      5817

array([[5329,    0],
       [ 488,    0]])
MNB - 0
Accuracy of MNB classifier on training set: 0.85
Accuracy of MNB classifier on test set: 0.85
             precision    recall  f1-score   support

        0.0       0.85      1.00      0.92      4956
        1.0       0.25      0.00      0.00       861

avg / total       0.76      0.85      0.78      5817

array([[4953,    3],
       [ 860,    1]])
Accuracy of MNB classifier on training set: 0.91
Accuracy of MNB classifier on test set: 0.92
             precision    recall  f1-score   support

        0.0       0.92      1.00      0.96      5329
        1.0       0.00      0.00      0.00       488

avg / total       0.84      0.92      0.88      5817

array([[5325,    4],
       [ 488,    0]])
BNB - 0


2. Model Evaluation - All Categories in one Column

In merged_restaurant_and_groups dataframe, categories column contains arrays of categories since a single restaurant may have more than 1 categories. In this section, we flatten categories copying restaurant rows so that they can be used for classification for comparision with binary columns classification of categories.

In [827]:
# separate categories into individual rows for classification
merged_restaurant_and_groups = pd.read_pickle('merged_restaurant_and_groups.pkl')

ids = []
catl = []
for i,row in merged_restaurant_and_groups[['business_id', 'categories']].iterrows():
    for n in row['categories']:
        ids.append(row['business_id'])
        catl.append(n)
df_flat = pd.DataFrame({'business_id': ids, 'category': catl})
merged_restaurant_and_groups_flat = pd.merge(df_flat,merged_restaurant_and_groups, \
                                         left_on = 'business_id', right_on='business_id')

merged_restaurant_and_groups_flat.to_pickle('merged_restaurant_and_groups_flat.pkl')
In [828]:
# Delete columns that cannot added to new restaurant businesses or that will not be used full in a group

merged_restaurant_and_groups_flat = pd.read_pickle('merged_restaurant_and_groups_flat.pkl')
df_data_clean2 = merged_restaurant_and_groups_flat.drop(['business_id','index', 'categories', \
                                                                  'latitude', 'longitude', 'Group city','Group zip', 'Group neighborhood'], axis=1)

df_data_clean2.to_pickle('df_data_clean2.pkl')

print(df_data_clean2.shape)
df_data_clean2.head().transpose()
(27434, 76)
Out[828]:
0 1 2 3 4
category Sandwiches Sandwiches American (Traditional) American (New) American (Traditional)
label 0 0 0 0 0
Group size 6 6 6 6 6
Group latitude 37.8184 37.8184 37.8184 37.8184 37.8184
Group longitude -90.6483 -90.6483 -90.6483 -90.6483 -90.6483
Group reviews_count 278 278 278 278 278
Group user_count 278 278 278 278 278
Group total_stars 21 21 21 21 21
Group total_open 4 4 4 4 4
Group Sandwiches Supply 3 3 3 3 3
Group Sandwiches Demand 15.0456 15.0456 15.0456 15.0456 15.0456
Group American (Traditional) Supply 2 2 2 2 2
Group American (Traditional) Demand 17.1568 17.1568 17.1568 17.1568 17.1568
Group Pizza Supply 1 1 1 1 1
Group Pizza Demand 7.34724 7.34724 7.34724 7.34724 7.34724
Group Burgers Supply 0 0 0 0 0
Group Burgers Demand 6.77035 6.77035 6.77035 6.77035 6.77035
Group Italian Supply 0 0 0 0 0
Group Italian Demand 5.58491 5.58491 5.58491 5.58491 5.58491
Group Mexican Supply 0 0 0 0 0
Group Mexican Demand 4.89458 4.89458 4.89458 4.89458 4.89458
Group Chinese Supply 0 0 0 0 0
Group Chinese Demand 4.12099 4.12099 4.12099 4.12099 4.12099
Group American (New) Supply 2 2 2 2 2
Group American (New) Demand 10.1805 10.1805 10.1805 10.1805 10.1805
Group Japanese Supply 0 0 0 0 0
Group Japanese Demand 2.82212 2.82212 2.82212 2.82212 2.82212
Group Chicken Wings Supply 0 0 0 0 0
Group Chicken Wings Demand 4.23042 4.23042 4.23042 4.23042 4.23042
Group Seafood Supply 1 1 1 1 1
... ... ... ... ... ...
Group Vietnamese Demand 2.03833 2.03833 2.03833 2.03833 2.03833
Group Middle Eastern Supply 0 0 0 0 0
Group Middle Eastern Demand 1.75445 1.75445 1.75445 1.75445 1.75445
is_open 0 1 1 1 1
city Montréal Scottsdale Las Vegas Las Vegas Concord
neighborhood Rosemont-La Petite-Patrie Spring Valley Spring Valley
state QC AZ NV NV NC
postal_code H2G 1K7 85257 89148 89148 28027
stars 4 3 5 5 3
Sandwiches 1 1 0 0 0
American (Traditional) 0 0 1 1 1
Pizza 0 0 0 0 0
Burgers 0 0 0 0 0
Italian 0 0 0 0 0
Mexican 0 0 0 0 0
Chinese 0 0 0 0 0
American (New) 0 0 1 1 0
Japanese 0 0 0 0 0
Chicken Wings 0 0 0 0 0
Seafood 0 0 0 0 1
Sushi Bars 0 0 0 0 0
Canadian (New) 0 0 0 0 0
Asian Fusion 0 0 0 0 0
Mediterranean 0 0 0 0 0
Steakhouses 0 0 0 0 1
Indian 0 0 0 0 0
Thai 0 0 0 0 0
Vietnamese 0 0 0 0 0
Middle Eastern 0 0 0 0 0
count 6 6 6 6 6

76 rows × 5 columns

In [799]:
# run cross validation for combined categories
X = df_data_clean2[df_data_clean2.columns.difference(np.append(top_20_specific_categories,'category'))]
y = df_data_clean2['category']

# Transform strings into equivalent label values
for column in X.columns:
    if X[column].dtype == type(object):
        le = LabelEncoder()
        X[column] = le.fit_transform(X[column])
        
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

X_scaled = scaler.fit_transform(X)

result = run_classifiers(X = X_scaled, y = y, num_splits = 10, rnd_state = 1, __bar = __bar14)
LR: 0.706714 (0.007363)
LDA: 0.690424 (0.046093)
KNN: 0.704345 (0.007048)
CART: 0.705074 (0.007564)
GNB: 0.706678 (0.007315)
MNB: 0.679054 (0.071577)
BNB: 0.706496 (0.007314)
LSVM: 0.706678 (0.007375)
SVM: 0.706678 (0.007375)
RF: 0.705439 (0.007137)
BAG: 0.706605 (0.007312)
In [987]:
pd.DataFrame({'results': [result]}, columns=['results']).to_pickle('df_cross_val_resutls.pkl')
In [993]:
cross_val_results = pd.read_pickle('df_cross_val_resutls.pkl')['results'][0]

models = ['LR','LDA','KNN','CART','GNB','MNB','BNB','LSVM','SVM','RF','BAG']
mean_cross_val = []
for x in cross_val_results:
    mean_cross_val.append(np.mean(x))
mean_cross_val

iplot([{
    'x': models,
    'y': mean_cross_val,
    'name': "Cross Validation Mean"
}], filename='cufflinks/simple-line3')

The above graph shows that LinearRegression is slightly better performant for our analysis than any other classifier, so we will move forward with it for further analysis.

Principal Component Analysis

In [835]:
X = df_data_clean2[df_data_clean2.columns.difference(np.append(top_20_specific_categories,'category'))]

# Transform strings into equivalent label values
for column in X.columns:
    if X[column].dtype == type(object):
        le = LabelEncoder()
        X[column] = le.fit_transform(X[column])


models = []
models.append(('LR', LogisticRegression()))

y = df_data_clean2['category']
In [837]:
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

rfecv = RFECV(estimator=lr, step=1, cv=StratifiedKFold(10),
              scoring='accuracy')
rfecv.fit(X, y)
print(type(rfecv.grid_scores_))
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
<class 'numpy.ndarray'>
In [871]:
# Getting top 30 attributes from the REFCV and running component analysis on them

df_labeled_rankings = pd.DataFrame({'cols': df_data_clean2.columns \
                                    .difference(np.append(top_20_specific_categories,'category')), \
                      'ranks': rfecv.ranking_}).sort_values(['ranks']).reset_index(drop=True)

df_labeled_rankings.head(30)
Out[871]:
cols ranks
0 Group latitude 1
1 stars 1
2 is_open 1
3 Group Vietnamese Supply 1
4 Group Asian Fusion Supply 1
5 Group Middle Eastern Supply 1
6 Group Chicken Wings Supply 1
7 Group Mediterranean Supply 1
8 Group Canadian (New) Supply 1
9 Group Indian Supply 1
10 Group Steakhouses Supply 1
11 Group Thai Supply 1
12 Group American (New) Supply 1
13 Group Sushi Bars Supply 1
14 Group Japanese Supply 1
15 Group Pizza Supply 1
16 Group longitude 1
17 Group American (Traditional) Supply 1
18 Group total_open 1
19 Group size 1
20 Group Seafood Supply 1
21 Group Burgers Supply 2
22 Group Mexican Supply 3
23 Group Italian Supply 4
24 Group Chinese Supply 5
25 Group Sandwiches Supply 6
26 count 7
27 Group total_stars 8
28 state 9
29 Group Indian Demand 10
In [943]:
import warnings; warnings.simplefilter('ignore')
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.preprocessing import MinMaxScaler
df_data_clean2 = pd.read_pickle('df_data_clean2.pkl')
# Eliminate categories information from the dataset
X = df_data_clean2[df_data_clean2.columns.difference(np.append(top_20_specific_categories,'category'))]

# Keep just the top 30 columns ranked for RFECV above
X = df_data_clean2[df_labeled_rankings['cols'].head(30).values]
#Transform strings into equivalent label values
for column in X.columns:
    if X[column].dtype == type(object):
        le = LabelEncoder()
        X[column] = le.fit_transform(X[column])


model = LogisticRegression()
name = 'LR'

y = df_data_clean2['category']


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=120, test_size = 0.3)

start_time = time.time()


scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


model.fit(X_train_scaled, y_train)

print('Accuracy of Logistic Regression classifier on training set: {:.2f}'.format(model.score(X_train_scaled, y_train)))
print('Accuracy of Logistic Regression classifier on test set: {:.2f}'.format(model.score(X_test_scaled, y_test)))

y_pred = model.predict(X_test_scaled)

print(classification_report(y_test,y_pred))
display(confusion_matrix(y_test, y_pred))

draw_confusion_matrix_all(top_20_specific_categories, y_test, y_pred)
Accuracy of Logistic Regression classifier on training set: 0.72
Accuracy of Logistic Regression classifier on test set: 0.67
                        precision    recall  f1-score   support

        American (New)       0.62      0.62      0.62       529
American (Traditional)       0.70      0.63      0.66       904
          Asian Fusion       0.42      0.39      0.40       253
               Burgers       0.67      0.72      0.69       667
        Canadian (New)       0.67      0.72      0.69       244
         Chicken Wings       0.53      0.34      0.41       342
               Chinese       0.81      0.84      0.82       536
                Indian       0.84      0.91      0.87       171
               Italian       0.67      0.54      0.60       581
              Japanese       0.58      0.72      0.64       328
         Mediterranean       0.61      0.62      0.61       208
               Mexican       0.90      0.95      0.93       555
        Middle Eastern       0.70      0.61      0.65       157
                 Pizza       0.65      0.80      0.72       801
            Sandwiches       0.75      0.74      0.75       894
               Seafood       0.51      0.54      0.52       276
           Steakhouses       0.42      0.47      0.45       187
            Sushi Bars       0.54      0.32      0.40       271
                  Thai       0.64      0.74      0.69       185
            Vietnamese       0.72      0.73      0.73       142

           avg / total       0.67      0.67      0.67      8231

array([[328,  23,   5,  53,  22,   2,   3,   0,  16,   1,  13,   6,   1,
          5,  22,   7,  19,   2,   1,   0],
       [ 92, 571,   5,  74,  21,   6,   6,   0,  19,   0,   4,   6,   0,
         28,  28,  11,  32,   1,   0,   0],
       [  4,   2,  98,   4,   2,   0,  46,   6,   7,  28,   0,   1,   0,
          0,   0,  11,   4,   9,  21,  10],
       [ 15,  74,   1, 483,   2,   2,   0,   1,   3,   1,   6,   3,   1,
         20,  35,   2,  16,   2,   0,   0],
       [  5,   6,   4,  21, 175,   0,   0,   2,   5,   1,   0,   1,   0,
          4,   1,  12,   4,   0,   0,   3],
       [ 14,  43,   3,  30,   0, 117,   1,   0,  32,   0,   0,   3,   1,
         61,  21,   7,   9,   0,   0,   0],
       [  0,   1,  20,   0,   3,   0, 449,   2,   0,  11,   0,   0,   0,
          0,   0,  18,   0,   8,  20,   4],
       [  1,   0,   4,   0,   0,   0,   4, 155,   0,   0,   1,   1,   1,
          0,   1,   0,   0,   0,   2,   1],
       [  6,   3,   0,   2,   8,  44,   1,   1, 313,   0,   6,   2,   1,
        140,  34,  11,   9,   0,   0,   0],
       [  1,   1,  28,   1,   0,   0,   8,   0,   3, 236,   0,   0,   0,
          0,   0,   0,   2,  38,   9,   1],
       [  3,   6,   1,   1,   6,   1,   0,   0,   8,   1, 128,   3,  26,
          7,  16,   0,   1,   0,   0,   0],
       [  8,   2,   5,   0,   0,   0,   1,   0,   0,   1,   0, 526,   0,
          1,   6,   5,   0,   0,   0,   0],
       [  2,   1,   1,   4,   0,   1,   0,   9,   0,   0,  38,   1,  95,
          2,   3,   0,   0,   0,   0,   0],
       [ 21,   3,   0,   5,   7,  33,   2,   3,  36,   0,   2,   1,   0,
        643,  38,   4,   2,   0,   0,   1],
       [ 19,  46,   1,  30,   6,  12,   0,   0,  19,   0,   4,   1,   8,
         69, 665,   3,   6,   0,   1,   4],
       [  7,  17,   7,   6,   0,   2,   4,   3,   6,  13,   5,  24,   2,
          9,   5, 148,  13,   2,   1,   2],
       [  3,  17,   1,   9,   6,   2,   1,   2,   0,   2,   1,   2,   0,
          3,   5,  36,  88,   9,   0,   0],
       [  1,   2,  35,   0,   1,   0,  10,   0,   0, 108,   0,   1,   0,
          1,   0,  11,   2,  88,   8,   3],
       [  0,   0,  11,   0,   1,   0,  10,   0,   1,   4,   1,   0,   0,
          0,   0,   5,   1,   4, 136,  11],
       [  1,   1,   6,   1,   0,   0,   7,   0,   0,   3,   0,   0,   0,
          0,   6,   0,   0,   1,  12, 104]])

Classifiers Function

This is a convenience function runs 11 most popular classifier and returns the results on the dataset.

In [942]:
import warnings; warnings.simplefilter('ignore')
import seaborn as sns

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier

def run_classifiers(X, y, num_splits, rnd_state, __bar):

    seed = 1
    # prepare models
    models = []
    models.append(('LR', LogisticRegression()))
    models.append(('LDA', LinearDiscriminantAnalysis()))
    models.append(('KNN', KNeighborsClassifier()))
    models.append(('CART', DecisionTreeClassifier()))
    models.append(('GNB', GaussianNB()))
    models.append(('MNB', MultinomialNB()))
    models.append(('BNB', BernoulliNB()))
    models.append(('LSVM', LinearSVC()))
    models.append(('SVM', SVC()))
    models.append(('RF', RandomForestClassifier()))
    models.append(('BAG', BaggingClassifier()))

    # evaluate each model in turn
    results = []
    names = []
    scoring = 'accuracy'
    for name, model in models:
        kfold = model_selection.KFold(n_splits=num_splits, random_state=seed)
        cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
        __bar.value += 1
    return results

from pylab import rcParams

def draw_confusion_matrix(category, y_test, y_pred):
    
    rcParams['figure.figsize'] = 20, 20
    faceLabels = ['Not {} (0)'.format(category),'{} (1)'.format(category)]
    mat = confusion_matrix(y_test, y_pred)
    sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
                xticklabels = faceLabels, cmap="BuPu", linecolor='black', linewidths=1,
                yticklabels = faceLabels)
    plt.xlabel('Actual')
    plt.ylabel('Predicted');
    plt.show()

def draw_confusion_matrix_all(categories, y_test, y_pred):
    
    rcParams['figure.figsize'] = 20, 20
    faceLabels = categories
    mat = confusion_matrix(y_test, y_pred)
    sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
                xticklabels = faceLabels, cmap="BuPu", linecolor='black', linewidths=1,
                yticklabels = faceLabels)
    plt.xlabel('Actual')
    plt.ylabel('Predicted');
    plt.show()
In [929]:
# data imports
df_clust_group_info = pd.read_pickle('df_clust_group_info.pkl')
df_restaurants_label_filtered = pd.read_pickle('df_restaurants_label_filtered.pkl')
top_20_specific_categories = pd.read_pickle('df_top_20_specific_categories.pkl')['categories'].values
display(df_clust_group_info.head(2))
display(df_restaurants_label_filtered.head(2))
display(top_20_specific_categories)
size latitude longitude city zip neighborhood reviews_count user_count Sandwiches Supply Sandwiches Demand ... Steakhouses Supply Steakhouses Demand Indian Supply Indian Demand Thai Supply Thai Demand Vietnamese Supply Vietnamese Demand Middle Eastern Supply Middle Eastern Demand
label
0 10 45.540408 -73.599135 Montréal H2G 1K7, H2G 1K8, H2G 1K9, H2S 1V1, H2S 1K7, H... Rosemont-La Petite-Patrie 144.0 142.0 1.0 5.380714 ... 0.0 0.877739 0.0 1.711929 1.0 0.661825 3.0 1.529594 0.0 1.940887
1 4 43.713006 -79.633204 Mississauga L4T 1A8 Ridgewood 46.0 44.0 0.0 1.284429 ... 0.0 0.256007 0.0 1.196288 1.0 0.709098 0.0 0.551915 0.0 1.160313

2 rows × 48 columns

business_id latitude longitude city neighborhood state postal_code stars categories Sandwiches ... Canadian (New) Asian Fusion Mediterranean Steakhouses Indian Thai Vietnamese Middle Eastern label count
2 O8S5hYJ1SMc8fA4QBtVujA 45.540503 -73.599300 Montréal Rosemont-La Petite-Patrie QC H2G 1K7 4.0 [Sandwiches] 1 ... 0 0 0 0 0 0 0 0 0 10
5343 ps03u_P469lpTqYHOedgUw 45.540476 -73.598844 Montréal Rosemont-La Petite-Patrie QC H2G 1K8 4.0 [] 0 ... 0 0 0 0 0 0 0 0 0 10

2 rows × 31 columns

array(['Sandwiches', 'American (Traditional)', 'Pizza', 'Burgers',
       'Italian', 'Mexican', 'Chinese', 'American (New)', 'Japanese',
       'Chicken Wings', 'Seafood', 'Sushi Bars', 'Canadian (New)',
       'Asian Fusion', 'Mediterranean', 'Steakhouses', 'Indian', 'Thai',
       'Vietnamese', 'Middle Eastern'], dtype=object)

Define 10 and 20 graphing limits and their equivalent colors on the maps

In [930]:
mapbox_access_token = 'pk.eyJ1IjoiZjhheml6IiwiYSI6ImNqb3plOWp6MjA0bXIzcnFxczZ1bjdrbmwifQ.5qd5W4B06UUZc20Jax12OA'

#interval_10 = pd.interval_range(start=4, periods=10, freq=4, closed='both').to_tuples()
limits_10 = [(4,10),(11,20),(21,30),(31,40),(41,50),(51,70),(71,100),(101,200),(201,400),(401,2000)]
colors_10 = ['#0000FF', '#008080',  '#FF0000', '#008000', '#808000', '#000080', '#C36900', \
          '#FF00FF', '#800080','#00FF00']

# interval_20 = pd.interval_range(start=4, periods=20, freq=2, closed='both').to_tuples()
limits_20 = [(4,5),(6,10),(11,15),(16,20),(21,25),(26,30),(31,35),(35,40),(41,45),(45,50),(51,60), \
             (61,70),(71,80),(81,100),(101,150),(151,200),(201,300),(301,400),(401,1000),(1001,2000)]

colors_20 = ['RGB(230,25,75)','RGB(60,180,75)','RGB(255,225,25)','RGB(67,99,216)','RGB(245,130,49)', \
             'RGB(145,30,180)','RGB(70,240,240)','RGB(240,50,230)','RGB(188,246,12)','RGB(250,190,190)', \
             'RGB(0,128,128)', 'RGB(230,190,255)','RGB(154,99,36)','RGB(255,250,200)','RGB(170,255,195)', \
             'RGB(255,216,177)','RGB(0,0,117)','RGB(128,128,128)','RGB(128,0,0)','RGB(128,128,0)']

For city level map, 10 interval limits and 10 colors are used to indicate each cluster's size and color

In [931]:
label_sizes = df_restaurants_label_filtered[['business_id','label']].groupby(['label']).agg(['count'])
label_sizes['business_id']['count'].nlargest(10)
Out[931]:
label
8      1749
31      617
148     458
4       408
6       319
47      269
283     254
293     254
182     230
251     218
Name: count, dtype: int64

Massaging of the data for display purposes only

In [935]:
df_clust_group_info.head()
df_clust_group_info['label'] = df_clust_group_info.index
for index,row in df_clust_group_info.iterrows():
    df_clust_group_info.at[index, 'neighborhood'] = (row['neighborhood'][:50] + (row['neighborhood'][:50] and '...'))
    df_clust_group_info.at[index, 'zip'] = (row['zip'][:50] + (row['zip'][:50] and '...'))
In [933]:
clusters = []
scale = 1

for i in range(len(limits_20)):
    lim = limits_20[i]
    df_sub = df_clust_group_info[((df_clust_group_info['size'] >= lim[0]) \
                                      & (df_clust_group_info['size'] <= lim[1]))]
    cluster = dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lon = df_sub['longitude'],
        lat = df_sub['latitude'],
        text = 'City: ' + df_sub['city'] + \
        '<br>Neighborhood(s): ' + df_sub['neighborhood'] + \
        '<br> Zip/Postal Code(s):' + df_sub['zip'],
        sizemode = 'diameter',
        marker = dict( 
            size = [i*scale]*len(df_sub), 
            color = colors_20[i],
            line = dict(width = 2,color = 'black')
        ),
        name = '{0} - {1}'.format(lim[0],lim[1]) )
    clusters.append(cluster)

layout = dict(
        title = 'Yelp Reviewed Restaurants in North America',
        showlegend = True,
        geo = dict(
            scope='north america',
            projection=dict( type='albers usa canada' ),
            resolution= 50,
            lonaxis= {
                'range': [-150, -55]
            },
            lataxis= {
                'range': [30, 50]
            },
            center=dict(
            lat=43.6543,
            lon=-79.3860
        ),
            showland = True,
            landcolor = 'rgb(217, 217, 217)',       
            subunitwidth=1,
            countrywidth=1,
            subunitcolor="rgb(255, 255, 100)",
            countrycolor="rgba(255, 200, 255)"           
        ),  
    )
    
fig = dict( data=clusters, layout=layout )
display(HTML('<a id="north_america_clustered">North America Clustered Restaurants by Location (All Categories)</a>'))
iplot( fig, validate=False, filename='d3-bubble-map-populations' )
In [936]:
df_clust_group_info = pd.read_pickle('df_clust_group_info.pkl')

clusters = []
scale = 3

for i in range(len(limits_10)):
    lim = limits_10[i]
    df_sub = df_clust_group_info[((df_clust_group_info['size'] >= lim[0]) \
                                      & (df_clust_group_info['size'] <= lim[1]))]
    cluster = dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lon = df_sub['longitude'],
        lat = df_sub['latitude'],
        text = 'City: ' + df_sub['city'] + \
        '<br>Size: ' + df_sub['size'].astype(str) + \
        '<br>Neighborhood: ' + df_sub['neighborhood'] + \
        '<br>Postal Code:' + df_sub['zip'],
        sizemode = 'diameter',
        marker = dict( 
            size = [i*scale]*len(df_sub),
            color = colors_10[i],
            line = dict(width = 2,color = 'black')
        ),
        name = '{0} - {1}'.format(lim[0],lim[1]) )
    clusters.append(cluster)

layout = dict(
        title = 'Yelp Reviewed Clustered Restaurants in Toronto',
        showlegend = True,
        geo = dict(
            scope='north america',
            projection=dict( type='albers usa canada', scale=500 ),
            resolution= 50,
            lonaxis= {
                'range': [-130, -55]
            },
            lataxis= {
                'range': [30, 50]
            },
                    center=dict(
            lat=43.6543,
            lon=-79.3860
        ),
            showland = True,
            landcolor = 'rgb(217, 217, 217)',       
            subunitwidth=1,
            countrywidth=1,
            subunitcolor="rgb(120, 120, 120)",
            countrycolor="rgb(255, 255, 255)"           
        ),  
    )
    
fig = dict( data=clusters, layout=layout )

display(HTML('<a id="toronto_clustered">All Clustered Restaurants on Sketch (Toronto)</a>'))

iplot( fig, validate=False, filename='d3-bubble-map-populations' )
In [937]:
clusters = []
scale = 4

for i in range(len(limits_10)):
    lim = limits_10[i]
    df_sub = df_clust_group_info[((df_clust_group_info['size'] >= lim[0]) \
                                    & (df_clust_group_info['size'] <= lim[1]))]
    cluster = go.Scattermapbox(
        lon = df_sub['longitude'],
        lat = df_sub['latitude'],
        text = 'Cluster #: ' + df_sub.index.astype(str) + \
        '<br>Size: ' + df_sub['size'].astype(str) + \
        '<br>City: ' + df_sub['city'] + \
        '<br>Neighborhood: ' + df_sub['neighborhood'] + \
        '<br>Postal Code:' + df_sub['zip'],
        mode = 'markers',
        marker = dict( 
            size = [i*scale]*len(df_sub), 
            color = colors_10[i]
        ),
        name = '[{0} - {1}]'.format(lim[0],lim[1]) )
    border = go.Scattermapbox(
        lon = df_sub['longitude'],
        lat = df_sub['latitude'],
        mode='markers',
        marker=dict(
            size=[i * scale + 1]*len(df_sub),
            color='black',
            opacity=0.4
        ),
        hoverinfo='none',
        showlegend=False)
    clusters.append(border)
    clusters.append(cluster)
layout = go.Layout(
    title = 'Yelp Reviewed Clustered Restaurants on Toronto Map',
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=43.6543,
            lon=-79.3860
        ),
        pitch=0,
        zoom=12
    ),
    
)


fig = dict(data=clusters, layout=layout)

display(HTML('<a id="toronto_clustered_map">All Clustered Restaurants on Map (Toronto)</a>'))

iplot(fig, filename='Multiple Mapbox')
In [938]:
demand_cats = [x + ' Demand' for x in top_20_specific_categories]
supply_cats = [x + ' Supply' for x in top_20_specific_categories]
local_demand_cats = [x + ' Local Demand' for x in top_20_specific_categories]
display_cats = [x + ' Display' for x in top_20_specific_categories]


# Add new Local Demand columns for each category
df_clust_group_info[local_demand_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])
df_clust_group_info[display_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])

scaler = MinMaxScaler(feature_range=(0, 1))

t = None
for index,row in df_clust_group_info.iterrows():
    cluster_supply = row[supply_cats].transpose().sum()
    cluster_demand = row[demand_cats].transpose().sum()
    cluster_adjustment_ratio = (cluster_supply / cluster_demand) if cluster_demand > 0 else 0
    for x in top_20_specific_categories:
        localDemand = round(row[x + ' Demand'] * cluster_adjustment_ratio)
        df_clust_group_info.at[index, x + ' Local Demand'] = localDemand
    # apply (n - min)/(max - min) formula to the difference of Local Demand and Supply to normalize display
    diff = row[supply_cats].values - df_clust_group_info.loc[index, local_demand_cats].values
    scaled = scaler.fit_transform(diff.astype('float64').reshape(-1,1))
    for i in range(len(diff)):
        df_clust_group_info.at[index, top_20_specific_categories[i] + ' Display'] = scaled[i]
            

df_clust_group_info[display_cats].head() 
        
Out[938]:
Sandwiches Display American (Traditional) Display Pizza Display Burgers Display Italian Display Mexican Display Chinese Display American (New) Display Japanese Display Chicken Wings Display Seafood Display Sushi Bars Display Canadian (New) Display Asian Fusion Display Mediterranean Display Steakhouses Display Indian Display Thai Display Vietnamese Display Middle Eastern Display
label
0 0.400000 0.400000 0.400000 0.200000 0.200000 0.400000 0.4 0.400000 0.60000 0.400000 0.400000 0.600000 0.00000 0.600000 0.400000 0.400000 0.400000 0.600000 1.000000 0.400000
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.00000 1.000000 1.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000
2 0.333333 0.666667 0.333333 0.666667 0.333333 0.333333 0.0 0.333333 1.00000 0.333333 0.666667 0.333333 0.00000 0.666667 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333
3 0.571429 0.685714 0.685714 0.571429 0.485714 0.714286 0.2 0.542857 1.00000 0.657143 0.542857 0.885714 0.00000 0.742857 0.657143 0.542857 0.542857 0.628571 0.542857 0.800000
4 0.718750 0.468750 0.687500 0.750000 0.718750 0.843750 0.0 0.531250 0.96875 0.593750 0.750000 0.843750 0.09375 0.843750 1.000000 0.781250 0.812500 0.843750 0.875000 0.625000
In [944]:
clusters = []
scale = 4
colors = ['maroon', 'purple', 'navy', 'teal', 'olive']
for x in range(0, len(top_20_specific_categories)):
    cat = top_20_specific_categories[x]
    for i in range(len(limits_10)):
        lim = limits_10[i]
        df_sub = df_clust_group_info[((df_clust_group_info['size'] >= lim[0]) \
                                    & (df_clust_group_info['size'] <= lim[1]))]
        
        demandStr, supplyStr = '{} Demand'.format(cat), '{} Supply'.format(cat)
        local_sd_ratio = df_sub[demandStr].max() / df_sub[supplyStr].max()
        
        cluster = go.Scattermapbox(
            lon = df_sub['longitude'],
            lat = df_sub['latitude'],
            text = 'Category: {}'.format(cat) + \
            '<br>Size: ' + df_sub['size'].astype(str) + \
            '<br>City: ' + df_sub['city'] + \
            '<br>Demand: ' + df_sub['{} Local Demand'.format(cat)].astype(str) + \
            '<br>Supply: ' + df_sub['{} Supply'.format(cat)].astype(str) + \
            '<br>Neighborhood: ' + df_sub['neighborhood'] + \
            '<br>Postal Code:' + df_sub['zip'],
            mode = 'markers',
            marker = dict( 
                size = [i*scale]*len(df_sub), 
                color = colors[x % 5],
                opacity = df_sub['{} Display'.format(cat)] 
            ),
            name = '[{0} - {1}]'.format(lim[0],lim[1]) ,
            visible= (False if x > 0 else True)
        )
        clusters.append(cluster)
        
# add border for all clusters
for i in range(len(limits_10)):
    lim = limits_10[i]
    df_sub = df_clust_group_info[((df_clust_group_info['size'] >= lim[0]) \
                                    & (df_clust_group_info['size'] <= lim[1]))]
    border = go.Scattermapbox(
        lon = df_sub['longitude'],
        lat = df_sub['latitude'],
        mode='markers',
        marker=dict(
            size=[i * scale + 1]*len(df_sub),
            color='black',
            opacity=0.1
        ),
        hoverinfo='none',
        visible=True,
        showlegend=False)
    clusters.append(border)

        
        
steps = []
trc_count = 10
traces_per_category =  10
category_size = len(top_20_specific_categories)
v = [False] * traces_per_category * category_size + [True] * traces_per_category

for i in range(0, category_size):
    step = dict(method='restyle',
                args=['visible', v[0:i * trc_count] + [True] * trc_count + v[ (i+1) * trc_count: len(v)]],
                label='{}'.format(top_20_specific_categories[i]))
    steps.append(step)

sliders = [dict(active=0,
                pad={"t": 1},
                steps=steps)]  
        
layout = go.Layout(
    title = 'Yelp Reviewed Restaurants Supply/Demand by Category Slider',
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=43.6543,
            lon=-79.3860
        ),
        pitch=0,
        zoom=12
    ),
    sliders = sliders
)


fig = dict(data=clusters, layout=layout)



display(HTML('<a id="toronto_clustered_categorized">Slider Controlled Categories Displaying Demand (Toronto)</a>'))

iplot(fig, filename='Multiple Mapbox')